Mike and Hari, Appreciate your prompt and detailed responses.
1. For timestamp header at agent - OOZIE based consumer work flows wait for data in a directory structure like -* /flume-data/YYYY/MM/DD/HH/MN/.*.we have a *contract* - a minute level directory can be consumed(is *immutable*) once the next minute directory is available. If "*timestamp*" injected by * clientLib* is used it's difficult to guarantee this contract *(messages coming late, clocks not synchronized, etc)*. Mike, specifically the configuration i was planning for is ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink)) => (M < N) collector agents (AvroSource + HDFSEventSink) 2. I agree that the agentPath use-case can be supported without headers through a separate HDFSEventSink configuration. This will ensure different agent's write to different path's(t*hereby avoiding any critical section)*issues. Reason for asking was to avoid directory structure like - */flume-data/<collector1>/YYYY/MM/DD/HH/MN* */flume-data/<collector1>/YYYY/MM/DD/HH/MN* ...................................... ...................................... */flume-data/<collector1>/YYYY/MM/DD/HH/MN* and instead have */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz (*Adding the collectorName avoids multiple folks writing to the same file issue*) However it's tough to obey the above mentioned contract - collector1 has moved forward to a new directory and collector2 is still writing to the old minute directory. Just wanted to avoid the additional hop of moving data from collector specific directories to one unified location, though i can live with it. I don't want to do something specific here and end maintaining a different version of FLUME :( Let me know what you guys think, i believe as the adoption grows so will use-cases which require adding/modifying headers at avroSource. Looking forward to hearing from you folks - Inder On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]> wrote: > Well, we almost support #1, although the way to do it is pass a > "timestamp" header at the first hop. Then you can use the BucketPath > shorthand stuff to name the hdfs.path according to this spec (except for > the agent-hostname thing). > > With #2 it seems reasonable to add support for an arbitrary "tag" header > or something like that which one could use in the hdfs.path as well. But it > would have to come from the first-hop agent at this point. The tag could > take the place of the hostname. > > Something that might get Flume closer to the below vision without hacking > the core is adding support for a plugin interface to AvroSource which can > annotate headers. However I worry that people might take this and try to do > all kinds of parsing and whatnot. So I think the first cut should only > support reading & setting headers. This is basically a "routing" feature > which I would argue Flume needs to be good at and flexible for. > > Just in case I misinterpreted the use case, I want to make sure we are not > trying to have multiple HDFSEventSink agents append to the same HDFS file > simultaneously, since I am pretty sure Hadoop doesn't support that. > > Inder, just to clarify, is this what you are doing? > > (N) event-generating agents (Custom Source + AvroSink) => (M < N) > collector agents (AvroSource + AvroSink) => Load-Balancing VIP => > (AvroSource + HDFSEventSink) => HDFS > > Best, > Mike > > On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote: > > > Hi Inder, > > > > I think these use cases are quite specific to your requirements. Even > though I did not clearly understand (2), I think that can be addressed > through configuration, and you would not need to add any new code for that. > I don't understand why you would want to inject a header in that case. You > can simply have different configurations for each of the agents, with > different sink paths. So agent A would have a sink configured to write to > /flume-data/agenta/.… and so on. > > > > I don't think we have support for something like (1) as of now. It does > not look like something which is very generic, and have not heard of > someone else having such a requirement. If you want this, the only way I > can see it, is to pick up AvroSource and add this support, and make it > configurable(on/off switch in the conf). > > > > Thanks > > Hari > > > > -- > > Hari Shreedharan > > > > > > On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote: > > > >> Folks, > >> > >> i have two use-cases and both seem to be landing under this requirement > >> > >> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN. > >> Timestamp is the arrival time on this agent. > >>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat is > i > >>> > >> > >> want to pass this header at the final agent in pipeline. > >> 2. Have multiple flume agents configured behind a VIP writing to the > same > >> HDFS sink path. > >>>> One of the way's is to have the path like - > >>> > >> > >> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN > >> Again can be addressed by passing a header "hostname" at flume agent and > >> configuring the sink path appropriately. > >> > >> Would appreciate any help on how to address this in a generic way in > FLUME. > >> Seems to be a generic use-case for anyone planning to take FLUME to > >> production. > >> > >> -- > >> Thanks, > >> - Inder > >> Tech Platforms @Inmobi > >> Linkedin - http://goo.gl/eR4Ub > >> > >> > > > > > > -- Thanks, - Inder Tech Platforms @Inmobi Linkedin - http://goo.gl/eR4Ub
