Mike / Patrick

Thanks for the replies. Sorry if this reply seems out of order and for
the delay. I was not subscribed to the mailing list when I sent my
earlier email. I just happened to stumble on your messages reading the
list history this morning. I appreciate the answers.

Patrick - I agree with you. It made sense to put it in an interceptor.
To me when I looked at an interceptor I thought of using it as a
replacement for a decorator in the old version of flume.

> 1. The max transaction size on the channel must currently be manually
> balanced with (or made to exceed) the batchSize setting on batching sources
> and sinks. If the number of events added or taken in a single transaction
> exceeds this maximum size, an exception will be thrown. However, if
> generating multiple events from a single event, it is no longer sufficient
> to make the batchSize less or equal to this value, and it would be easier
> to blow out your transaction size in a potentially unpredictable way,
> causing potentially confusing errors.

This makes sense.

> 2. An Event is what you might call the basic unit of "flow" in Flume. From
> the perspective of management and monitoring, having the same number of
> events enter and exit the system helps you know that your cluster is
> healthy. OTOH, when you generate a variable number of events from a single
> event in an Interceptor, it is really quite difficult to know how the data
> is flowing.

This seems like a good method for monitoring.

> 3. Since the interceptor typically runs in an I/O worker thread or in the
> only thread in a Source, doing any significant computation there will
> likely affect the overall throughput of the system.
> In my view, Interceptors as a generally applicable component are well
> suited to do header "tagging", simple transformations, and filtering, but
> they're not a good place to put batching/un-batching logic. Maybe the Exec
> Source should have a line-parsing plugin interface to allow people to take
> text lines and generate Events from them. I know this seems similar to the
> Interceptor in the context of the data flow, but I believe you are just
> trying to work around a limitation of the exec source, since it appears
> you're describing a serialization issue."

> Alternatively, one could use an HBase serializer to generate multiple
> increment / decrement operations, and just log the original line in HDFS
> (or use an EventSerializer).

The is what I'm working towards. I want a 1 for 1 entry in hdfs but
increment counters in hbase, so given the following input:

timestamp - 5,4,3,2,1

hive

timestamp - 5
timestamp - 4
timestamp - 3
timestamp - 2
timestamp - 1

hbase

timestamp - 5 increment
timestamp - 4 increment
timestamp - 3 increment
timestamp - 2 increment
timestamp - 1 increment

Given this I was just planning on emitting an event with the body I
was going to use in hive early in the pipeline. Send the same data to
hdfs and hbase. Then use a serializer on the hbase side to increment
the counters. This would allow me to add data to hdfs in the format
I'm planning on consuming it with without managing two serializers. My
plans for the hbase serializer was literally generate key, increment
per record based on the input. So only a couple lines of code.

> I pondered this a bit over the last day or so and I'm kind of lukewarm on
> adding preconditions checks at this time. The reason I didn't do it
> initially is that while I wanted a particular contract for that component,
> in order to make Interceptors viable to maintain and understand with the
> current design of the Flume core, I wasn't sure if it would be sufficient
> for all future use cases. So if someone wants to do something that breaks
> that contract, then they are "on their own", doing stuff that may break in
> future implementations. If they're willing to accept that risk then they
> have the freedom to maybe do something novel and awesome, which might
> prompt us to add a different kind of extension mechanism in the future to
> support whatever that use case is.

I think there should be an approved method for this case. A different
extension that could perform processing like this could be helpful. To
me when I looked at an interceptor I thought of using it as a
replacement for a decorator in the old version of flume. We have a lot
of code that will take a log entry and replace the body with a
protocol buffer representation. I prefer to run this code on an
upstream tier from the web server. Interceptors would work fine for
the one in one out case.

On Fri, Aug 10, 2012 at 1:07 PM, Jeremy Custenborder
<[email protected]> wrote:
> Hello All,
>
> I'm wondering if you could provide some guidance for me. One of the
> inputs I'm working with batches several entries to a single event.
> This is a lot simpler than my data but it provides an easy example.
> For example:
>
> timestamp - 5,4,3,2,1
> timestamp - 9,7,5,5,6
>
> If I tail the file this results in 2 events being generated. This
> example has the data for 10 events.
>
> Here is high level what I want to accomplish.
> (web server - agent 1)
> exec source tail -f /<some file path>
> collector-client to (agent 2)
>
> (collector - agent 2)
> collector-server
> Custom Interceptor (input 1 event, output n events)
> Multiplex to
> hdfs
> hbase
>
> An interceptor looked like the most logical spot for me to add this.
> Is there a better place to add this functionality? Has anyone run into
> a similar case?
>
> Looking at the docs for Interceptor. intercept(List<Event> events) it
> says "Output list of events. The size of output list MUST NOT BE
> GREATER than the size of the input list (i.e. transformation and
> removal ONLY)." which tells me not to emit more events than given.
> intercept(Event event) only returns a single event so I can't use it
> there either. Why is there a requirement to only return 1 for 1?
>
> For now I'm implementing a custom source that will handle generating
> multiple events from the events coming in on the web server. My
> preference was to do this transformation on the collector agent before
> I hand off to hdfs and hbase. I know another alternative would be to
> implement custom RPC but I would prefer not to do that. I would prefer
> to rely on what is currently available.
>
> Thanks!
> j

Reply via email to