Re: pre-filter before persisting?

S Ahmed Mon, 28 May 2012 14:28:10 -0700

I see ok, thanks for the clarifications.

On Wed, May 23, 2012 at 2:32 PM, Mike Percy <[email protected]> wrote:


>  Hi Ahmed,
> There are some flaws with what you are describing. Depending on how
> accurate you want it to be, you may or may not care about them.
>
> Using a streaming system like Flume to aggregate data can be problematic
> because the streaming system doesn't see all the events at once, so at best
> you would be able to increment counters or something in an online storage
> system (such as HBase). Also, because Flume guarantees delivery using
> at-least-once semantics, you can get duplicate events, and it's not
> straightforward to de-duplicate events in a streaming system. From a Flume
> perspective, we strive for duplicates to be rare and to fix bugs that cause
> an excessive number of duplicates, but in general they are within Flume's
> contract and they will occasionally occur. To sum this up, if you are
> simply incrementing counters without de-duplicating events, your numbers
> will diverge from reality over time.
>
> If you don't need the aggregate data in real-time for your application, I
> suggest using MapReduce to de-duplicate and aggregate your events in order
> to generate your reports or downstream data. In this case, you would simply
> store all of your events onto HDFS directly using Flume and then use a
> traditional batch-processing model (which can be pretty fast, on the order
> of 5-minute updates, with something like Oozie managing your workflows).
>
> If you really need real-time metrics (I recommend doing this only if
> you're sure you need it), then I would suggest using a Batch + Realtime
> architecture, in which your real-time system increments counters (for
> example, using Flume's HBase sink) and then your MapReduce batch jobs come
> along after the fact to correct historical data that has been skewed by
> duplicates (they rewrite the history to reflect reality).
>
> There is a presentation by Nathan Marz that describes a batch + real-time
> architecture here:
> http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
>
> Hope that helps.
>
> Mike
>
> On Wednesday, May 23, 2012 at 8:17 AM, S Ahmed wrote:
>
> Is this something that is possible w/o altering the source?
> Is it a good idea?
>
> On Fri, May 18, 2012 at 12:37 PM, S Ahmed <[email protected]> wrote:
>
> If I am storing page view statistics using flume, which flushes to a file
> every x seconds, is there a event/hook available that I can pre-filter the
> collection?
>
> Example:
>
> I want to roll-up the stats for a given minute, this will reduce the # of
> messages substantially. (I can filter using a dictionary where the
> sessionId is the key)
>
>
>
>

Re: pre-filter before persisting?

Reply via email to