Re: Support for header injection at agent(Final Hop in pipeline)

Inder Pall Fri, 13 Apr 2012 01:41:44 -0700

Mike,

concisely put i want to have YYYY/mm/DD/HH/MM(in the path) to be the
time-stamp of agent running the HDFSEventSink.
Current code uses timestamp header which is injected by client lib(running
on a different box) which doesn't work for me.


- inder

On Fri, Apr 13, 2012 at 12:45 PM, Mike Percy <[email protected]> wrote:

> Hi Inder,
> Can you briefly summarize what you want to do and what is missing from
> flume for you to do it?
>
> Seems like you could store in a structure like this with static configs:
> flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz
>
> As you mentioned, you would have to use one HDFS sink per stream/collector
> pair to define this statically in the config files.
>
> Is the problem that you want events to be strictly contained in a log file
> named according to their internal timestamp? Is it not acceptable to go by
> event delivery time at the agent?
>
> Best,
> Mike
>
> On Apr 12, 2012, at 12:33 AM, Inder Pall wrote:
>
> > Mike and Hari,
> >
> > Appreciate your prompt and detailed responses.
> >
> > 1. For timestamp header at agent -  OOZIE based consumer work flows wait
> > for data in a directory structure like -*
> /flume-data/YYYY/MM/DD/HH/MN/.*.we
> > have a *contract* - a minute level directory can be consumed(is
> *immutable*)
> > once the next minute directory is available. If  "*timestamp*" injected
> by *
> > clientLib* is used it's difficult to guarantee this contract *(messages
> > coming late, clocks not synchronized, etc)*.
> >
> > Mike, specifically the configuration i was planning for is
> > ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink))
> > => (M < N) collector agents (AvroSource + HDFSEventSink)
> >
> > 2. I agree that the agentPath use-case can be supported without headers
> > through a separate HDFSEventSink configuration. This will ensure
> different
> > agent's write to different path's(t*hereby avoiding any critical
> > section)*issues.
> >
> > Reason for asking was to avoid directory structure like -
> > */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> > */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> > ......................................
> > ......................................
> > */flume-data/<collector1>/YYYY/MM/DD/HH/MN*
> > and instead have
> > */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz
> > (*Adding
> > the collectorName avoids multiple folks writing to the same file issue*)
> >
> > However it's tough to obey the above mentioned contract - collector1 has
> > moved forward to a new directory and collector2 is still writing to the
> old
> > minute directory. Just wanted to avoid the additional hop of moving data
> > from collector specific directories to one unified location, though i can
> > live with it.
> >
> > I don't want to do something specific here and end maintaining a
> different
> > version of FLUME :(
> > Let me know what you guys think, i believe as the adoption grows so will
> > use-cases which require adding/modifying headers at avroSource.
> >
> > Looking forward to hearing from you folks
> >
> >
> > - Inder
> > On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]>
> wrote:
> >
> >> Well, we almost support #1, although the way to do it is pass a
> >> "timestamp" header at the first hop. Then you can use the BucketPath
> >> shorthand stuff to name the hdfs.path according to this spec (except for
> >> the agent-hostname thing).
> >>
> >> With #2 it seems reasonable to add support for an arbitrary "tag" header
> >> or something like that which one could use in the hdfs.path as well.
> But it
> >> would have to come from the first-hop agent at this point. The tag could
> >> take the place of the hostname.
> >>
> >> Something that might get Flume closer to the below vision without
> hacking
> >> the core is adding support for a plugin interface to AvroSource which
> can
> >> annotate headers. However I worry that people might take this and try
> to do
> >> all kinds of parsing and whatnot. So I think the first cut should only
> >> support reading & setting headers. This is basically a "routing" feature
> >> which I would argue Flume needs to be good at and flexible for.
> >>
> >> Just in case I misinterpreted the use case, I want to make sure we are
> not
> >> trying to have multiple HDFSEventSink agents append to the same HDFS
> file
> >> simultaneously, since I am pretty sure Hadoop doesn't support that.
> >>
> >> Inder, just to clarify, is this what you are doing?
> >>
> >> (N) event-generating agents (Custom Source + AvroSink) => (M < N)
> >> collector agents (AvroSource + AvroSink) => Load-Balancing VIP =>
> >> (AvroSource + HDFSEventSink) => HDFS
> >>
> >> Best,
> >> Mike
> >>
> >> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote:
> >>
> >>> Hi Inder,
> >>>
> >>> I think these use cases are quite specific to your requirements. Even
> >> though I did not clearly understand (2), I think that can be addressed
> >> through configuration, and you would not need to add any new code for
> that.
> >> I don't understand why you would want to inject a header in that case.
> You
> >> can simply have different configurations for each of the agents, with
> >> different sink paths. So agent A would have a sink configured to write
> to
> >> /flume-data/agenta/.… and so on.
> >>>
> >>> I don't think we have support for something like (1) as of now. It does
> >> not look like something which is very generic, and have not heard of
> >> someone else having such a requirement. If you want this, the only way I
> >> can see it, is to pick up AvroSource and add this support, and make it
> >> configurable(on/off switch in the conf).
> >>>
> >>> Thanks
> >>> Hari
> >>>
> >>> --
> >>> Hari Shreedharan
> >>>
> >>>
> >>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote:
> >>>
> >>>> Folks,
> >>>>
> >>>> i have two use-cases and both seem to be landing under this
> requirement
> >>>>
> >>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN.
> >>>> Timestamp is the arrival time on this agent.
> >>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat
> is
> >> i
> >>>>>
> >>>>
> >>>> want to pass this header at the final agent in pipeline.
> >>>> 2. Have multiple flume agents configured behind a VIP writing to the
> >> same
> >>>> HDFS sink path.
> >>>>>> One of the way's is to have the path like -
> >>>>>
> >>>>
> >>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN
> >>>> Again can be addressed by passing a header "hostname" at flume agent
> and
> >>>> configuring the sink path appropriately.
> >>>>
> >>>> Would appreciate any help on how to address this in a generic way in
> >> FLUME.
> >>>> Seems to be a generic use-case for anyone planning to take FLUME to
> >>>> production.
> >>>>
> >>>> --
> >>>> Thanks,
> >>>> - Inder
> >>>> Tech Platforms @Inmobi
> >>>> Linkedin - http://goo.gl/eR4Ub
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
> > --
> > Thanks,
> > - Inder
> >  Tech Platforms @Inmobi
> >  Linkedin - http://goo.gl/eR4Ub
>
>


-- 
Thanks,
- Inder
  Tech Platforms @Inmobi
  Linkedin - http://goo.gl/eR4Ub

Re: Support for header injection at agent(Final Hop in pipeline)

Reply via email to