I think the application would want to replay historical events into Samza. I.e 
the application can replay any events older then X days from HDFS into Samza. 
Once Samza has processed the historical events, the application can switch 
input to the Kafka queue to process the more recent and finally also the 
currently-arriving events. This way the Samza code can stay the same. 
Best regards,Tom      From: Zach Cox <zcox...@gmail.com>
 To: dev@samza.apache.org 
 Sent: Friday, May 29, 2015 5:20 PM
 Subject: Re: Reprocessing old events no longer in Kafka
   
Let's also add to the story: say the company wants to only write code for
Samza, and not duplicate the same code in MapReduce jobs (or any other
framework).



On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:

> Why not run a map reduce job on the data in hdfs? what is was made for.
> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>
> > Hi -
> >
> > Let's say one day a company wants to start doing all of this awesome data
> > integration/near-real-time stream processing stuff, so they start sending
> > their user activity events (e.g. pageviews, ad impressions, etc) to
> Kafka.
> > Then they hook up Camus to copy new events from Kafka to HDFS every hour.
> > They use the default Kafka log retention period of 7 days. So after a few
> > months, Kafka has the last 7 days of events, and HDFS has all events
> except
> > the newest events not yet transferred by Camus.
> >
> > Then the company wants to build out a system that uses Samza to process
> the
> > user activity events from Kafka and output it to some queryable data
> store.
> > If standard Samza reprocessing [1] is used, then only the last 7 days of
> > events in Kafka get processed and put into the data store. Of course,
> then
> > all future events also seamlessly get processed by the Samza jobs and put
> > into the data store, which is awesome.
> >
> > But let's say this company needs all of the historical events to be
> > processed by Samza and put into the data store (i.e. the events older
> than
> > 7 days that are in HDFS but no longer in Kafka). It's a Business Critical
> > thing and absolutely must happen. How should this company achieve this?
> >
> > I'm sure there are many potential solutions to this problem, but has
> anyone
> > actually done this? What approach did you take?
> >
> > Any experiences or thoughts would be hugely appreciated.
> >
> > Thanks,
> > Zach
> >
> > [1]
> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> >
>


   

Reply via email to