how about supporting something like
"host2.sources.src1.header.timestamp=true" as config.
This overrides time-stamp header on host2->src1(avro source) for all events.

- inder


On Fri, Apr 13, 2012 at 2:29 PM, Mike Percy <[email protected]> wrote:

> Funny, I thought it was doing the opposite thing. :)
>
> It should be very easy to implement what you are describing and seems like
> a common use case. We just need some decent syntax or a configuration
> setting to indicate which timestamp we are talking about.
>
> Mike
>
> On Apr 13, 2012, at 1:41 AM, Inder Pall wrote:
>
> > Mike,
> >
> > concisely put i want to have YYYY/mm/DD/HH/MM(in the path) to be the
> > time-stamp of agent running the HDFSEventSink.
> > Current code uses timestamp header which is injected by client
> lib(running
> > on a different box) which doesn't work for me.
> >
> > - inder
> >
> > On Fri, Apr 13, 2012 at 12:45 PM, Mike Percy <[email protected]>
> wrote:
> >
> >> Hi Inder,
> >> Can you briefly summarize what you want to do and what is missing from
> >> flume for you to do it?
> >>
> >> Seems like you could store in a structure like this with static configs:
> >> flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz
> >>
> >> As you mentioned, you would have to use one HDFS sink per
> stream/collector
> >> pair to define this statically in the config files.
> >>
> >> Is the problem that you want events to be strictly contained in a log
> file
> >> named according to their internal timestamp? Is it not acceptable to go
> by
> >> event delivery time at the agent?
> >>
> >> Best,
> >> Mike
> >>
> >> On Apr 12, 2012, at 12:33 AM, Inder Pall wrote:
> >>
> >>> Mike and Hari,
> >>>
> >>> Appreciate your prompt and detailed responses.
> >>>
> >>> 1. For timestamp header at agent -  OOZIE based consumer work flows
> wait
> >>> for data in a directory structure like -*
> >> /flume-data/YYYY/MM/DD/HH/MN/.*.we
> >>> have a *contract* - a minute level directory can be consumed(is
> >> *immutable*)
> >>> once the next minute directory is available. If  "*timestamp*" injected
> >> by *
> >>> clientLib* is used it's difficult to guarantee this contract *(messages
> >>> coming late, clocks not synchronized, etc)*.
> >>>
> >>> Mike, specifically the configuration i was planning for is
> >>> ((N) ClientLib *=>* (N) event-generating agents (Avro Source +
> AvroSink))
> >>> => (M < N) collector agents (AvroSource + HDFSEventSink)
> >>>
> >>> 2. I agree that the agentPath use-case can be supported without headers
> >>> through a separate HDFSEventSink configuration. This will ensure
> >> different
> >>> agent's write to different path's(t*hereby avoiding any critical
> >>> section)*issues.
> >>>
> >>> Reason for asking was to avoid directory structure like -
> >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> >>> ......................................
> >>> ......................................
> >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> >>> and instead have
> >>>
> */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz
> >>> (*Adding
> >>> the collectorName avoids multiple folks writing to the same file
> issue*)
> >>>
> >>> However it's tough to obey the above mentioned contract - collector1
> has
> >>> moved forward to a new directory and collector2 is still writing to the
> >> old
> >>> minute directory. Just wanted to avoid the additional hop of moving
> data
> >>> from collector specific directories to one unified location, though i
> can
> >>> live with it.
> >>>
> >>> I don't want to do something specific here and end maintaining a
> >> different
> >>> version of FLUME :(
> >>> Let me know what you guys think, i believe as the adoption grows so
> will
> >>> use-cases which require adding/modifying headers at avroSource.
> >>>
> >>> Looking forward to hearing from you folks
> >>>
> >>>
> >>> - Inder
> >>> On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]>
> >> wrote:
> >>>
> >>>> Well, we almost support #1, although the way to do it is pass a
> >>>> "timestamp" header at the first hop. Then you can use the BucketPath
> >>>> shorthand stuff to name the hdfs.path according to this spec (except
> for
> >>>> the agent-hostname thing).
> >>>>
> >>>> With #2 it seems reasonable to add support for an arbitrary "tag"
> header
> >>>> or something like that which one could use in the hdfs.path as well.
> >> But it
> >>>> would have to come from the first-hop agent at this point. The tag
> could
> >>>> take the place of the hostname.
> >>>>
> >>>> Something that might get Flume closer to the below vision without
> >> hacking
> >>>> the core is adding support for a plugin interface to AvroSource which
> >> can
> >>>> annotate headers. However I worry that people might take this and try
> >> to do
> >>>> all kinds of parsing and whatnot. So I think the first cut should only
> >>>> support reading & setting headers. This is basically a "routing"
> feature
> >>>> which I would argue Flume needs to be good at and flexible for.
> >>>>
> >>>> Just in case I misinterpreted the use case, I want to make sure we are
> >> not
> >>>> trying to have multiple HDFSEventSink agents append to the same HDFS
> >> file
> >>>> simultaneously, since I am pretty sure Hadoop doesn't support that.
> >>>>
> >>>> Inder, just to clarify, is this what you are doing?
> >>>>
> >>>> (N) event-generating agents (Custom Source + AvroSink) => (M < N)
> >>>> collector agents (AvroSource + AvroSink) => Load-Balancing VIP =>
> >>>> (AvroSource + HDFSEventSink) => HDFS
> >>>>
> >>>> Best,
> >>>> Mike
> >>>>
> >>>> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote:
> >>>>
> >>>>> Hi Inder,
> >>>>>
> >>>>> I think these use cases are quite specific to your requirements. Even
> >>>> though I did not clearly understand (2), I think that can be addressed
> >>>> through configuration, and you would not need to add any new code for
> >> that.
> >>>> I don't understand why you would want to inject a header in that case.
> >> You
> >>>> can simply have different configurations for each of the agents, with
> >>>> different sink paths. So agent A would have a sink configured to write
> >> to
> >>>> /flume-data/agenta/.… and so on.
> >>>>>
> >>>>> I don't think we have support for something like (1) as of now. It
> does
> >>>> not look like something which is very generic, and have not heard of
> >>>> someone else having such a requirement. If you want this, the only
> way I
> >>>> can see it, is to pick up AvroSource and add this support, and make it
> >>>> configurable(on/off switch in the conf).
> >>>>>
> >>>>> Thanks
> >>>>> Hari
> >>>>>
> >>>>> --
> >>>>> Hari Shreedharan
> >>>>>
> >>>>>
> >>>>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote:
> >>>>>
> >>>>>> Folks,
> >>>>>>
> >>>>>> i have two use-cases and both seem to be landing under this
> >> requirement
> >>>>>>
> >>>>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN.
> >>>>>> Timestamp is the arrival time on this agent.
> >>>>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat
> >> is
> >>>> i
> >>>>>>>
> >>>>>>
> >>>>>> want to pass this header at the final agent in pipeline.
> >>>>>> 2. Have multiple flume agents configured behind a VIP writing to the
> >>>> same
> >>>>>> HDFS sink path.
> >>>>>>>> One of the way's is to have the path like -
> >>>>>>>
> >>>>>>
> >>>>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN
> >>>>>> Again can be addressed by passing a header "hostname" at flume agent
> >> and
> >>>>>> configuring the sink path appropriately.
> >>>>>>
> >>>>>> Would appreciate any help on how to address this in a generic way in
> >>>> FLUME.
> >>>>>> Seems to be a generic use-case for anyone planning to take FLUME to
> >>>>>> production.
> >>>>>>
> >>>>>> --
> >>>>>> Thanks,
> >>>>>> - Inder
> >>>>>> Tech Platforms @Inmobi
> >>>>>> Linkedin - http://goo.gl/eR4Ub
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Thanks,
> >>> - Inder
> >>> Tech Platforms @Inmobi
> >>> Linkedin - http://goo.gl/eR4Ub
> >>
> >>
> >
> >
> > --
> > Thanks,
> > - Inder
> >  Tech Platforms @Inmobi
> >  Linkedin - http://goo.gl/eR4Ub
>
>


-- 
Thanks,
- Inder
  Tech Platforms @Inmobi
  Linkedin - http://goo.gl/eR4Ub

Reply via email to