I put that comment there for a few reasons that I can recall off the top of
my head (I should have done a better job documenting this when I was
writing the code):

1. The max transaction size on the channel must currently be manually
balanced with (or made to exceed) the batchSize setting on batching sources
and sinks. If the number of events added or taken in a single transaction
exceeds this maximum size, an exception will be thrown. However, if
generating multiple events from a single event, it is no longer sufficient
to make the batchSize less or equal to this value, and it would be easier
to blow out your transaction size in a potentially unpredictable way,
causing potentially confusing errors.

2. An Event is what you might call the basic unit of "flow" in Flume. From
the perspective of management and monitoring, having the same number of
events enter and exit the system helps you know that your cluster is
healthy. OTOH, when you generate a variable number of events from a single
event in an Interceptor, it is really quite difficult to know how the data
is flowing.

3. Since the interceptor typically runs in an I/O worker thread or in the
only thread in a Source, doing any significant computation there will
likely affect the overall throughput of the system.

In my view, Interceptors as a generally applicable component are well
suited to do header "tagging", simple transformations, and filtering, but
they're not a good place to put batching/un-batching logic. Maybe the Exec
Source should have a line-parsing plugin interface to allow people to take
text lines and generate Events from them. I know this seems similar to the
Interceptor in the context of the data flow, but I believe you are just
trying to work around a limitation of the exec source, since it appears
you're describing a serialization issue.

Alternatively, one could use an HBase serializer to generate multiple
increment / decrement operations, and just log the original line in HDFS
(or use an EventSerializer).

Regards,
Mike

On Fri, Aug 10, 2012 at 5:15 PM, Patrick Wendell <[email protected]> wrote:

> to clarify - I mean I think it's within the scope of the design
> intentions. I agree that it is currently disallowed (at least in
> documentation).
>
> On Fri, Aug 10, 2012 at 5:14 PM, Patrick Wendell <[email protected]>
> wrote:
> > Hey Jeremy,
> >
> > That comment has been in the code now for some time, but I don't think
> > it is actually enforced anywhere programatically. I think the idea was
> > just that if you are writing something which is capable of generating
> > new event data it should be in a source - though I'm also curious to
> > hear why this was put in there.
> >
> > IMHO, doing some type of event splitting seems within the scope of how
> > interceptors are used.
> >
> > - Patrick
> >
> > On Fri, Aug 10, 2012 at 11:07 AM, Jeremy Custenborder
> > <[email protected]> wrote:
> >> Hello All,
> >>
> >> I'm wondering if you could provide some guidance for me. One of the
> >> inputs I'm working with batches several entries to a single event.
> >> This is a lot simpler than my data but it provides an easy example.
> >> For example:
> >>
> >> timestamp - 5,4,3,2,1
> >> timestamp - 9,7,5,5,6
> >>
> >> If I tail the file this results in 2 events being generated. This
> >> example has the data for 10 events.
> >>
> >> Here is high level what I want to accomplish.
> >> (web server - agent 1)
> >> exec source tail -f /<some file path>
> >> collector-client to (agent 2)
> >>
> >> (collector - agent 2)
> >> collector-server
> >> Custom Interceptor (input 1 event, output n events)
> >> Multiplex to
> >> hdfs
> >> hbase
> >>
> >> An interceptor looked like the most logical spot for me to add this.
> >> Is there a better place to add this functionality? Has anyone run into
> >> a similar case?
> >>
> >> Looking at the docs for Interceptor. intercept(List<Event> events) it
> >> says "Output list of events. The size of output list MUST NOT BE
> >> GREATER than the size of the input list (i.e. transformation and
> >> removal ONLY)." which tells me not to emit more events than given.
> >> intercept(Event event) only returns a single event so I can't use it
> >> there either. Why is there a requirement to only return 1 for 1?
> >>
> >> For now I'm implementing a custom source that will handle generating
> >> multiple events from the events coming in on the web server. My
> >> preference was to do this transformation on the collector agent before
> >> I hand off to hdfs and hbase. I know another alternative would be to
> >> implement custom RPC but I would prefer not to do that. I would prefer
> >> to rely on what is currently available.
> >>
> >> Thanks!
> >> j
>

Reply via email to