Hi Ahmed, There are some flaws with what you are describing. Depending on how accurate you want it to be, you may or may not care about them.
Using a streaming system like Flume to aggregate data can be problematic because the streaming system doesn't see all the events at once, so at best you would be able to increment counters or something in an online storage system (such as HBase). Also, because Flume guarantees delivery using at-least-once semantics, you can get duplicate events, and it's not straightforward to de-duplicate events in a streaming system. From a Flume perspective, we strive for duplicates to be rare and to fix bugs that cause an excessive number of duplicates, but in general they are within Flume's contract and they will occasionally occur. To sum this up, if you are simply incrementing counters without de-duplicating events, your numbers will diverge from reality over time. If you don't need the aggregate data in real-time for your application, I suggest using MapReduce to de-duplicate and aggregate your events in order to generate your reports or downstream data. In this case, you would simply store all of your events onto HDFS directly using Flume and then use a traditional batch-processing model (which can be pretty fast, on the order of 5-minute updates, with something like Oozie managing your workflows). If you really need real-time metrics (I recommend doing this only if you're sure you need it), then I would suggest using a Batch + Realtime architecture, in which your real-time system increments counters (for example, using Flume's HBase sink) and then your MapReduce batch jobs come along after the fact to correct historical data that has been skewed by duplicates (they rewrite the history to reflect reality). There is a presentation by Nathan Marz that describes a batch + real-time architecture here: http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems Hope that helps. Mike On Wednesday, May 23, 2012 at 8:17 AM, S Ahmed wrote: > Is this something that is possible w/o altering the source? > Is it a good idea? > > On Fri, May 18, 2012 at 12:37 PM, S Ahmed <[email protected] > (mailto:[email protected])> wrote: > > If I am storing page view statistics using flume, which flushes to a file > > every x seconds, is there a event/hook available that I can pre-filter the > > collection? > > > > Example: > > > > I want to roll-up the stats for a given minute, this will reduce the # of > > messages substantially. (I can filter using a dictionary where the > > sessionId is the key)
