Re: Support for header injection at agent(Final Hop in pipeline)

Mike Percy Fri, 13 Apr 2012 01:59:54 -0700

Funny, I thought it was doing the opposite thing. :)

It should be very easy to implement what you are describing and seems like a 
common use case. We just need some decent syntax or a configuration setting to 
indicate which timestamp we are talking about.


Mike

On Apr 13, 2012, at 1:41 AM, Inder Pall wrote:

> Mike,
> 
> concisely put i want to have YYYY/mm/DD/HH/MM(in the path) to be the
> time-stamp of agent running the HDFSEventSink.
> Current code uses timestamp header which is injected by client lib(running
> on a different box) which doesn't work for me.
> 
> - inder
> 
> On Fri, Apr 13, 2012 at 12:45 PM, Mike Percy <[email protected]> wrote:
> 
>> Hi Inder,
>> Can you briefly summarize what you want to do and what is missing from
>> flume for you to do it?
>> 
>> Seems like you could store in a structure like this with static configs:
>> flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz
>> 
>> As you mentioned, you would have to use one HDFS sink per stream/collector
>> pair to define this statically in the config files.
>> 
>> Is the problem that you want events to be strictly contained in a log file
>> named according to their internal timestamp? Is it not acceptable to go by
>> event delivery time at the agent?
>> 
>> Best,
>> Mike
>> 
>> On Apr 12, 2012, at 12:33 AM, Inder Pall wrote:
>> 
>>> Mike and Hari,
>>> 
>>> Appreciate your prompt and detailed responses.
>>> 
>>> 1. For timestamp header at agent -  OOZIE based consumer work flows wait
>>> for data in a directory structure like -*
>> /flume-data/YYYY/MM/DD/HH/MN/.*.we
>>> have a *contract* - a minute level directory can be consumed(is
>> *immutable*)
>>> once the next minute directory is available. If  "*timestamp*" injected
>> by *
>>> clientLib* is used it's difficult to guarantee this contract *(messages
>>> coming late, clocks not synchronized, etc)*.
>>> 
>>> Mike, specifically the configuration i was planning for is
>>> ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink))
>>> => (M < N) collector agents (AvroSource + HDFSEventSink)
>>> 
>>> 2. I agree that the agentPath use-case can be supported without headers
>>> through a separate HDFSEventSink configuration. This will ensure
>> different
>>> agent's write to different path's(t*hereby avoiding any critical
>>> section)*issues.
>>> 
>>> Reason for asking was to avoid directory structure like -
>>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
>>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
>>> ......................................
>>> ......................................
>>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
>>> and instead have
>>> */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz
>>> (*Adding
>>> the collectorName avoids multiple folks writing to the same file issue*)
>>> 
>>> However it's tough to obey the above mentioned contract - collector1 has
>>> moved forward to a new directory and collector2 is still writing to the
>> old
>>> minute directory. Just wanted to avoid the additional hop of moving data
>>> from collector specific directories to one unified location, though i can
>>> live with it.
>>> 
>>> I don't want to do something specific here and end maintaining a
>> different
>>> version of FLUME :(
>>> Let me know what you guys think, i believe as the adoption grows so will
>>> use-cases which require adding/modifying headers at avroSource.
>>> 
>>> Looking forward to hearing from you folks
>>> 
>>> 
>>> - Inder
>>> On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]>
>> wrote:
>>> 
>>>> Well, we almost support #1, although the way to do it is pass a
>>>> "timestamp" header at the first hop. Then you can use the BucketPath
>>>> shorthand stuff to name the hdfs.path according to this spec (except for
>>>> the agent-hostname thing).
>>>> 
>>>> With #2 it seems reasonable to add support for an arbitrary "tag" header
>>>> or something like that which one could use in the hdfs.path as well.
>> But it
>>>> would have to come from the first-hop agent at this point. The tag could
>>>> take the place of the hostname.
>>>> 
>>>> Something that might get Flume closer to the below vision without
>> hacking
>>>> the core is adding support for a plugin interface to AvroSource which
>> can
>>>> annotate headers. However I worry that people might take this and try
>> to do
>>>> all kinds of parsing and whatnot. So I think the first cut should only
>>>> support reading & setting headers. This is basically a "routing" feature
>>>> which I would argue Flume needs to be good at and flexible for.
>>>> 
>>>> Just in case I misinterpreted the use case, I want to make sure we are
>> not
>>>> trying to have multiple HDFSEventSink agents append to the same HDFS
>> file
>>>> simultaneously, since I am pretty sure Hadoop doesn't support that.
>>>> 
>>>> Inder, just to clarify, is this what you are doing?
>>>> 
>>>> (N) event-generating agents (Custom Source + AvroSink) => (M < N)
>>>> collector agents (AvroSource + AvroSink) => Load-Balancing VIP =>
>>>> (AvroSource + HDFSEventSink) => HDFS
>>>> 
>>>> Best,
>>>> Mike
>>>> 
>>>> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote:
>>>> 
>>>>> Hi Inder,
>>>>> 
>>>>> I think these use cases are quite specific to your requirements. Even
>>>> though I did not clearly understand (2), I think that can be addressed
>>>> through configuration, and you would not need to add any new code for
>> that.
>>>> I don't understand why you would want to inject a header in that case.
>> You
>>>> can simply have different configurations for each of the agents, with
>>>> different sink paths. So agent A would have a sink configured to write
>> to
>>>> /flume-data/agenta/.… and so on.
>>>>> 
>>>>> I don't think we have support for something like (1) as of now. It does
>>>> not look like something which is very generic, and have not heard of
>>>> someone else having such a requirement. If you want this, the only way I
>>>> can see it, is to pick up AvroSource and add this support, and make it
>>>> configurable(on/off switch in the conf).
>>>>> 
>>>>> Thanks
>>>>> Hari
>>>>> 
>>>>> --
>>>>> Hari Shreedharan
>>>>> 
>>>>> 
>>>>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote:
>>>>> 
>>>>>> Folks,
>>>>>> 
>>>>>> i have two use-cases and both seem to be landing under this
>> requirement
>>>>>> 
>>>>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN.
>>>>>> Timestamp is the arrival time on this agent.
>>>>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat
>> is
>>>> i
>>>>>>> 
>>>>>> 
>>>>>> want to pass this header at the final agent in pipeline.
>>>>>> 2. Have multiple flume agents configured behind a VIP writing to the
>>>> same
>>>>>> HDFS sink path.
>>>>>>>> One of the way's is to have the path like -
>>>>>>> 
>>>>>> 
>>>>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN
>>>>>> Again can be addressed by passing a header "hostname" at flume agent
>> and
>>>>>> configuring the sink path appropriately.
>>>>>> 
>>>>>> Would appreciate any help on how to address this in a generic way in
>>>> FLUME.
>>>>>> Seems to be a generic use-case for anyone planning to take FLUME to
>>>>>> production.
>>>>>> 
>>>>>> --
>>>>>> Thanks,
>>>>>> - Inder
>>>>>> Tech Platforms @Inmobi
>>>>>> Linkedin - http://goo.gl/eR4Ub
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Thanks,
>>> - Inder
>>> Tech Platforms @Inmobi
>>> Linkedin - http://goo.gl/eR4Ub
>> 
>> 
> 
> 
> -- 
> Thanks,
> - Inder
>  Tech Platforms @Inmobi
>  Linkedin - http://goo.gl/eR4Ub

Re: Support for header injection at agent(Final Hop in pipeline)

Reply via email to