I see ok, thanks for the clarifications. On Wed, May 23, 2012 at 2:32 PM, Mike Percy <[email protected]> wrote:
> Hi Ahmed, > There are some flaws with what you are describing. Depending on how > accurate you want it to be, you may or may not care about them. > > Using a streaming system like Flume to aggregate data can be problematic > because the streaming system doesn't see all the events at once, so at best > you would be able to increment counters or something in an online storage > system (such as HBase). Also, because Flume guarantees delivery using > at-least-once semantics, you can get duplicate events, and it's not > straightforward to de-duplicate events in a streaming system. From a Flume > perspective, we strive for duplicates to be rare and to fix bugs that cause > an excessive number of duplicates, but in general they are within Flume's > contract and they will occasionally occur. To sum this up, if you are > simply incrementing counters without de-duplicating events, your numbers > will diverge from reality over time. > > If you don't need the aggregate data in real-time for your application, I > suggest using MapReduce to de-duplicate and aggregate your events in order > to generate your reports or downstream data. In this case, you would simply > store all of your events onto HDFS directly using Flume and then use a > traditional batch-processing model (which can be pretty fast, on the order > of 5-minute updates, with something like Oozie managing your workflows). > > If you really need real-time metrics (I recommend doing this only if > you're sure you need it), then I would suggest using a Batch + Realtime > architecture, in which your real-time system increments counters (for > example, using Flume's HBase sink) and then your MapReduce batch jobs come > along after the fact to correct historical data that has been skewed by > duplicates (they rewrite the history to reflect reality). > > There is a presentation by Nathan Marz that describes a batch + real-time > architecture here: > http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems > > Hope that helps. > > Mike > > On Wednesday, May 23, 2012 at 8:17 AM, S Ahmed wrote: > > Is this something that is possible w/o altering the source? > Is it a good idea? > > On Fri, May 18, 2012 at 12:37 PM, S Ahmed <[email protected]> wrote: > > If I am storing page view statistics using flume, which flushes to a file > every x seconds, is there a event/hook available that I can pre-filter the > collection? > > Example: > > I want to roll-up the stats for a given minute, this will reduce the # of > messages substantially. (I can filter using a dictionary where the > sessionId is the key) > > > >
