Hi Inder,
Can you briefly summarize what you want to do and what is missing from flume 
for you to do it?

Seems like you could store in a structure like this with static configs: 
flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz

As you mentioned, you would have to use one HDFS sink per stream/collector pair 
to define this statically in the config files.

Is the problem that you want events to be strictly contained in a log file 
named according to their internal timestamp? Is it not acceptable to go by 
event delivery time at the agent?

Best,
Mike

On Apr 12, 2012, at 12:33 AM, Inder Pall wrote:

> Mike and Hari,
> 
> Appreciate your prompt and detailed responses.
> 
> 1. For timestamp header at agent -  OOZIE based consumer work flows wait
> for data in a directory structure like -* /flume-data/YYYY/MM/DD/HH/MN/.*.we
> have a *contract* - a minute level directory can be consumed(is *immutable*)
> once the next minute directory is available. If  "*timestamp*" injected by *
> clientLib* is used it's difficult to guarantee this contract *(messages
> coming late, clocks not synchronized, etc)*.
> 
> Mike, specifically the configuration i was planning for is
> ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink))
> => (M < N) collector agents (AvroSource + HDFSEventSink)
> 
> 2. I agree that the agentPath use-case can be supported without headers
> through a separate HDFSEventSink configuration. This will ensure different
> agent's write to different path's(t*hereby avoiding any critical
> section)*issues.
> 
> Reason for asking was to avoid directory structure like -
> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> ......................................
> ......................................
> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> and instead have
> */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz
> (*Adding
> the collectorName avoids multiple folks writing to the same file issue*)
> 
> However it's tough to obey the above mentioned contract - collector1 has
> moved forward to a new directory and collector2 is still writing to the old
> minute directory. Just wanted to avoid the additional hop of moving data
> from collector specific directories to one unified location, though i can
> live with it.
> 
> I don't want to do something specific here and end maintaining a different
> version of FLUME :(
> Let me know what you guys think, i believe as the adoption grows so will
> use-cases which require adding/modifying headers at avroSource.
> 
> Looking forward to hearing from you folks
> 
> 
> - Inder
> On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]> wrote:
> 
>> Well, we almost support #1, although the way to do it is pass a
>> "timestamp" header at the first hop. Then you can use the BucketPath
>> shorthand stuff to name the hdfs.path according to this spec (except for
>> the agent-hostname thing).
>> 
>> With #2 it seems reasonable to add support for an arbitrary "tag" header
>> or something like that which one could use in the hdfs.path as well. But it
>> would have to come from the first-hop agent at this point. The tag could
>> take the place of the hostname.
>> 
>> Something that might get Flume closer to the below vision without hacking
>> the core is adding support for a plugin interface to AvroSource which can
>> annotate headers. However I worry that people might take this and try to do
>> all kinds of parsing and whatnot. So I think the first cut should only
>> support reading & setting headers. This is basically a "routing" feature
>> which I would argue Flume needs to be good at and flexible for.
>> 
>> Just in case I misinterpreted the use case, I want to make sure we are not
>> trying to have multiple HDFSEventSink agents append to the same HDFS file
>> simultaneously, since I am pretty sure Hadoop doesn't support that.
>> 
>> Inder, just to clarify, is this what you are doing?
>> 
>> (N) event-generating agents (Custom Source + AvroSink) => (M < N)
>> collector agents (AvroSource + AvroSink) => Load-Balancing VIP =>
>> (AvroSource + HDFSEventSink) => HDFS
>> 
>> Best,
>> Mike
>> 
>> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote:
>> 
>>> Hi Inder,
>>> 
>>> I think these use cases are quite specific to your requirements. Even
>> though I did not clearly understand (2), I think that can be addressed
>> through configuration, and you would not need to add any new code for that.
>> I don't understand why you would want to inject a header in that case. You
>> can simply have different configurations for each of the agents, with
>> different sink paths. So agent A would have a sink configured to write to
>> /flume-data/agenta/.… and so on.
>>> 
>>> I don't think we have support for something like (1) as of now. It does
>> not look like something which is very generic, and have not heard of
>> someone else having such a requirement. If you want this, the only way I
>> can see it, is to pick up AvroSource and add this support, and make it
>> configurable(on/off switch in the conf).
>>> 
>>> Thanks
>>> Hari
>>> 
>>> --
>>> Hari Shreedharan
>>> 
>>> 
>>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote:
>>> 
>>>> Folks,
>>>> 
>>>> i have two use-cases and both seem to be landing under this requirement
>>>> 
>>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN.
>>>> Timestamp is the arrival time on this agent.
>>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat is
>> i
>>>>> 
>>>> 
>>>> want to pass this header at the final agent in pipeline.
>>>> 2. Have multiple flume agents configured behind a VIP writing to the
>> same
>>>> HDFS sink path.
>>>>>> One of the way's is to have the path like -
>>>>> 
>>>> 
>>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN
>>>> Again can be addressed by passing a header "hostname" at flume agent and
>>>> configuring the sink path appropriately.
>>>> 
>>>> Would appreciate any help on how to address this in a generic way in
>> FLUME.
>>>> Seems to be a generic use-case for anyone planning to take FLUME to
>>>> production.
>>>> 
>>>> --
>>>> Thanks,
>>>> - Inder
>>>> Tech Platforms @Inmobi
>>>> Linkedin - http://goo.gl/eR4Ub
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> -- 
> Thanks,
> - Inder
>  Tech Platforms @Inmobi
>  Linkedin - http://goo.gl/eR4Ub

Reply via email to