Mike, concisely put i want to have YYYY/mm/DD/HH/MM(in the path) to be the time-stamp of agent running the HDFSEventSink. Current code uses timestamp header which is injected by client lib(running on a different box) which doesn't work for me.
- inder On Fri, Apr 13, 2012 at 12:45 PM, Mike Percy <[email protected]> wrote: > Hi Inder, > Can you briefly summarize what you want to do and what is missing from > flume for you to do it? > > Seems like you could store in a structure like this with static configs: > flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz > > As you mentioned, you would have to use one HDFS sink per stream/collector > pair to define this statically in the config files. > > Is the problem that you want events to be strictly contained in a log file > named according to their internal timestamp? Is it not acceptable to go by > event delivery time at the agent? > > Best, > Mike > > On Apr 12, 2012, at 12:33 AM, Inder Pall wrote: > > > Mike and Hari, > > > > Appreciate your prompt and detailed responses. > > > > 1. For timestamp header at agent - OOZIE based consumer work flows wait > > for data in a directory structure like -* > /flume-data/YYYY/MM/DD/HH/MN/.*.we > > have a *contract* - a minute level directory can be consumed(is > *immutable*) > > once the next minute directory is available. If "*timestamp*" injected > by * > > clientLib* is used it's difficult to guarantee this contract *(messages > > coming late, clocks not synchronized, etc)*. > > > > Mike, specifically the configuration i was planning for is > > ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink)) > > => (M < N) collector agents (AvroSource + HDFSEventSink) > > > > 2. I agree that the agentPath use-case can be supported without headers > > through a separate HDFSEventSink configuration. This will ensure > different > > agent's write to different path's(t*hereby avoiding any critical > > section)*issues. > > > > Reason for asking was to avoid directory structure like - > > */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > > */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > > ...................................... > > ...................................... > > */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > > and instead have > > */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz > > (*Adding > > the collectorName avoids multiple folks writing to the same file issue*) > > > > However it's tough to obey the above mentioned contract - collector1 has > > moved forward to a new directory and collector2 is still writing to the > old > > minute directory. Just wanted to avoid the additional hop of moving data > > from collector specific directories to one unified location, though i can > > live with it. > > > > I don't want to do something specific here and end maintaining a > different > > version of FLUME :( > > Let me know what you guys think, i believe as the adoption grows so will > > use-cases which require adding/modifying headers at avroSource. > > > > Looking forward to hearing from you folks > > > > > > - Inder > > On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]> > wrote: > > > >> Well, we almost support #1, although the way to do it is pass a > >> "timestamp" header at the first hop. Then you can use the BucketPath > >> shorthand stuff to name the hdfs.path according to this spec (except for > >> the agent-hostname thing). > >> > >> With #2 it seems reasonable to add support for an arbitrary "tag" header > >> or something like that which one could use in the hdfs.path as well. > But it > >> would have to come from the first-hop agent at this point. The tag could > >> take the place of the hostname. > >> > >> Something that might get Flume closer to the below vision without > hacking > >> the core is adding support for a plugin interface to AvroSource which > can > >> annotate headers. However I worry that people might take this and try > to do > >> all kinds of parsing and whatnot. So I think the first cut should only > >> support reading & setting headers. This is basically a "routing" feature > >> which I would argue Flume needs to be good at and flexible for. > >> > >> Just in case I misinterpreted the use case, I want to make sure we are > not > >> trying to have multiple HDFSEventSink agents append to the same HDFS > file > >> simultaneously, since I am pretty sure Hadoop doesn't support that. > >> > >> Inder, just to clarify, is this what you are doing? > >> > >> (N) event-generating agents (Custom Source + AvroSink) => (M < N) > >> collector agents (AvroSource + AvroSink) => Load-Balancing VIP => > >> (AvroSource + HDFSEventSink) => HDFS > >> > >> Best, > >> Mike > >> > >> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote: > >> > >>> Hi Inder, > >>> > >>> I think these use cases are quite specific to your requirements. Even > >> though I did not clearly understand (2), I think that can be addressed > >> through configuration, and you would not need to add any new code for > that. > >> I don't understand why you would want to inject a header in that case. > You > >> can simply have different configurations for each of the agents, with > >> different sink paths. So agent A would have a sink configured to write > to > >> /flume-data/agenta/.… and so on. > >>> > >>> I don't think we have support for something like (1) as of now. It does > >> not look like something which is very generic, and have not heard of > >> someone else having such a requirement. If you want this, the only way I > >> can see it, is to pick up AvroSource and add this support, and make it > >> configurable(on/off switch in the conf). > >>> > >>> Thanks > >>> Hari > >>> > >>> -- > >>> Hari Shreedharan > >>> > >>> > >>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote: > >>> > >>>> Folks, > >>>> > >>>> i have two use-cases and both seem to be landing under this > requirement > >>>> > >>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN. > >>>> Timestamp is the arrival time on this agent. > >>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat > is > >> i > >>>>> > >>>> > >>>> want to pass this header at the final agent in pipeline. > >>>> 2. Have multiple flume agents configured behind a VIP writing to the > >> same > >>>> HDFS sink path. > >>>>>> One of the way's is to have the path like - > >>>>> > >>>> > >>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN > >>>> Again can be addressed by passing a header "hostname" at flume agent > and > >>>> configuring the sink path appropriately. > >>>> > >>>> Would appreciate any help on how to address this in a generic way in > >> FLUME. > >>>> Seems to be a generic use-case for anyone planning to take FLUME to > >>>> production. > >>>> > >>>> -- > >>>> Thanks, > >>>> - Inder > >>>> Tech Platforms @Inmobi > >>>> Linkedin - http://goo.gl/eR4Ub > >>>> > >>>> > >>> > >>> > >> > >> > > > > > > -- > > Thanks, > > - Inder > > Tech Platforms @Inmobi > > Linkedin - http://goo.gl/eR4Ub > > -- Thanks, - Inder Tech Platforms @Inmobi Linkedin - http://goo.gl/eR4Ub
