Hi Ahmed, 
There are some flaws with what you are describing. Depending on how accurate 
you want it to be, you may or may not care about them.

Using a streaming system like Flume to aggregate data can be problematic 
because the streaming system doesn't see all the events at once, so at best you 
would be able to increment counters or something in an online storage system 
(such as HBase). Also, because Flume guarantees delivery using at-least-once 
semantics, you can get duplicate events, and it's not straightforward to 
de-duplicate events in a streaming system. From a Flume perspective, we strive 
for duplicates to be rare and to fix bugs that cause an excessive number of 
duplicates, but in general they are within Flume's contract and they will 
occasionally occur. To sum this up, if you are simply incrementing counters 
without de-duplicating events, your numbers will diverge from reality over time.

If you don't need the aggregate data in real-time for your application, I 
suggest using MapReduce to de-duplicate and aggregate your events in order to 
generate your reports or downstream data. In this case, you would simply store 
all of your events onto HDFS directly using Flume and then use a traditional 
batch-processing model (which can be pretty fast, on the order of 5-minute 
updates, with something like Oozie managing your workflows).

If you really need real-time metrics (I recommend doing this only if you're 
sure you need it), then I would suggest using a Batch + Realtime architecture, 
in which your real-time system increments counters (for example, using Flume's 
HBase sink) and then your MapReduce batch jobs come along after the fact to 
correct historical data that has been skewed by duplicates (they rewrite the 
history to reflect reality).

There is a presentation by Nathan Marz that describes a batch + real-time 
architecture here: 
http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems

Hope that helps.

Mike

On Wednesday, May 23, 2012 at 8:17 AM, S Ahmed wrote: 
> Is this something that is possible w/o altering the source?
> Is it a good idea?
> 
> On Fri, May 18, 2012 at 12:37 PM, S Ahmed <[email protected] 
> (mailto:[email protected])> wrote:
> > If I am storing page view statistics using flume, which flushes to a file 
> > every x seconds, is there a event/hook available that I can pre-filter the 
> > collection?
> > 
> > Example:
> > 
> > I want to roll-up the stats for a given minute, this will reduce the # of 
> > messages substantially. (I can filter using a dictionary where the 
> > sessionId is the key) 

Reply via email to