Funny, I thought it was doing the opposite thing. :) It should be very easy to implement what you are describing and seems like a common use case. We just need some decent syntax or a configuration setting to indicate which timestamp we are talking about.
Mike On Apr 13, 2012, at 1:41 AM, Inder Pall wrote: > Mike, > > concisely put i want to have YYYY/mm/DD/HH/MM(in the path) to be the > time-stamp of agent running the HDFSEventSink. > Current code uses timestamp header which is injected by client lib(running > on a different box) which doesn't work for me. > > - inder > > On Fri, Apr 13, 2012 at 12:45 PM, Mike Percy <[email protected]> wrote: > >> Hi Inder, >> Can you briefly summarize what you want to do and what is missing from >> flume for you to do it? >> >> Seems like you could store in a structure like this with static configs: >> flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz >> >> As you mentioned, you would have to use one HDFS sink per stream/collector >> pair to define this statically in the config files. >> >> Is the problem that you want events to be strictly contained in a log file >> named according to their internal timestamp? Is it not acceptable to go by >> event delivery time at the agent? >> >> Best, >> Mike >> >> On Apr 12, 2012, at 12:33 AM, Inder Pall wrote: >> >>> Mike and Hari, >>> >>> Appreciate your prompt and detailed responses. >>> >>> 1. For timestamp header at agent - OOZIE based consumer work flows wait >>> for data in a directory structure like -* >> /flume-data/YYYY/MM/DD/HH/MN/.*.we >>> have a *contract* - a minute level directory can be consumed(is >> *immutable*) >>> once the next minute directory is available. If "*timestamp*" injected >> by * >>> clientLib* is used it's difficult to guarantee this contract *(messages >>> coming late, clocks not synchronized, etc)*. >>> >>> Mike, specifically the configuration i was planning for is >>> ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink)) >>> => (M < N) collector agents (AvroSource + HDFSEventSink) >>> >>> 2. I agree that the agentPath use-case can be supported without headers >>> through a separate HDFSEventSink configuration. This will ensure >> different >>> agent's write to different path's(t*hereby avoiding any critical >>> section)*issues. >>> >>> Reason for asking was to avoid directory structure like - >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN* >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN* >>> ...................................... >>> ...................................... >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN* >>> and instead have >>> */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz >>> (*Adding >>> the collectorName avoids multiple folks writing to the same file issue*) >>> >>> However it's tough to obey the above mentioned contract - collector1 has >>> moved forward to a new directory and collector2 is still writing to the >> old >>> minute directory. Just wanted to avoid the additional hop of moving data >>> from collector specific directories to one unified location, though i can >>> live with it. >>> >>> I don't want to do something specific here and end maintaining a >> different >>> version of FLUME :( >>> Let me know what you guys think, i believe as the adoption grows so will >>> use-cases which require adding/modifying headers at avroSource. >>> >>> Looking forward to hearing from you folks >>> >>> >>> - Inder >>> On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]> >> wrote: >>> >>>> Well, we almost support #1, although the way to do it is pass a >>>> "timestamp" header at the first hop. Then you can use the BucketPath >>>> shorthand stuff to name the hdfs.path according to this spec (except for >>>> the agent-hostname thing). >>>> >>>> With #2 it seems reasonable to add support for an arbitrary "tag" header >>>> or something like that which one could use in the hdfs.path as well. >> But it >>>> would have to come from the first-hop agent at this point. The tag could >>>> take the place of the hostname. >>>> >>>> Something that might get Flume closer to the below vision without >> hacking >>>> the core is adding support for a plugin interface to AvroSource which >> can >>>> annotate headers. However I worry that people might take this and try >> to do >>>> all kinds of parsing and whatnot. So I think the first cut should only >>>> support reading & setting headers. This is basically a "routing" feature >>>> which I would argue Flume needs to be good at and flexible for. >>>> >>>> Just in case I misinterpreted the use case, I want to make sure we are >> not >>>> trying to have multiple HDFSEventSink agents append to the same HDFS >> file >>>> simultaneously, since I am pretty sure Hadoop doesn't support that. >>>> >>>> Inder, just to clarify, is this what you are doing? >>>> >>>> (N) event-generating agents (Custom Source + AvroSink) => (M < N) >>>> collector agents (AvroSource + AvroSink) => Load-Balancing VIP => >>>> (AvroSource + HDFSEventSink) => HDFS >>>> >>>> Best, >>>> Mike >>>> >>>> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote: >>>> >>>>> Hi Inder, >>>>> >>>>> I think these use cases are quite specific to your requirements. Even >>>> though I did not clearly understand (2), I think that can be addressed >>>> through configuration, and you would not need to add any new code for >> that. >>>> I don't understand why you would want to inject a header in that case. >> You >>>> can simply have different configurations for each of the agents, with >>>> different sink paths. So agent A would have a sink configured to write >> to >>>> /flume-data/agenta/.… and so on. >>>>> >>>>> I don't think we have support for something like (1) as of now. It does >>>> not look like something which is very generic, and have not heard of >>>> someone else having such a requirement. If you want this, the only way I >>>> can see it, is to pick up AvroSource and add this support, and make it >>>> configurable(on/off switch in the conf). >>>>> >>>>> Thanks >>>>> Hari >>>>> >>>>> -- >>>>> Hari Shreedharan >>>>> >>>>> >>>>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote: >>>>> >>>>>> Folks, >>>>>> >>>>>> i have two use-cases and both seem to be landing under this >> requirement >>>>>> >>>>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN. >>>>>> Timestamp is the arrival time on this agent. >>>>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat >> is >>>> i >>>>>>> >>>>>> >>>>>> want to pass this header at the final agent in pipeline. >>>>>> 2. Have multiple flume agents configured behind a VIP writing to the >>>> same >>>>>> HDFS sink path. >>>>>>>> One of the way's is to have the path like - >>>>>>> >>>>>> >>>>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN >>>>>> Again can be addressed by passing a header "hostname" at flume agent >> and >>>>>> configuring the sink path appropriately. >>>>>> >>>>>> Would appreciate any help on how to address this in a generic way in >>>> FLUME. >>>>>> Seems to be a generic use-case for anyone planning to take FLUME to >>>>>> production. >>>>>> >>>>>> -- >>>>>> Thanks, >>>>>> - Inder >>>>>> Tech Platforms @Inmobi >>>>>> Linkedin - http://goo.gl/eR4Ub >>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks, >>> - Inder >>> Tech Platforms @Inmobi >>> Linkedin - http://goo.gl/eR4Ub >> >> > > > -- > Thanks, > - Inder > Tech Platforms @Inmobi > Linkedin - http://goo.gl/eR4Ub
