Let's also add to the story: say the company wants to only write code for Samza, and not duplicate the same code in MapReduce jobs (or any other framework).
On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: > Why not run a map reduce job on the data in hdfs? what is was made for. > On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: > > > Hi - > > > > Let's say one day a company wants to start doing all of this awesome data > > integration/near-real-time stream processing stuff, so they start sending > > their user activity events (e.g. pageviews, ad impressions, etc) to > Kafka. > > Then they hook up Camus to copy new events from Kafka to HDFS every hour. > > They use the default Kafka log retention period of 7 days. So after a few > > months, Kafka has the last 7 days of events, and HDFS has all events > except > > the newest events not yet transferred by Camus. > > > > Then the company wants to build out a system that uses Samza to process > the > > user activity events from Kafka and output it to some queryable data > store. > > If standard Samza reprocessing [1] is used, then only the last 7 days of > > events in Kafka get processed and put into the data store. Of course, > then > > all future events also seamlessly get processed by the Samza jobs and put > > into the data store, which is awesome. > > > > But let's say this company needs all of the historical events to be > > processed by Samza and put into the data store (i.e. the events older > than > > 7 days that are in HDFS but no longer in Kafka). It's a Business Critical > > thing and absolutely must happen. How should this company achieve this? > > > > I'm sure there are many potential solutions to this problem, but has > anyone > > actually done this? What approach did you take? > > > > Any experiences or thoughts would be hugely appreciated. > > > > Thanks, > > Zach > > > > [1] > http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html > > >