Hi Felix - that sounds like a very interesting approach! I was trying to think through how to dump the events hdfs => kafka but got stuck on the transition point, didn't even think about using the offsets from camus. I will definitely investigate further.
Thanks, Zach On Fri, May 29, 2015 at 4:25 PM Felix GV <fville...@linkedin.com.invalid> wrote: > We developed camus2kafka at my previous job for this purpose of re-pushing > events to Kafka: https://github.com/mate1/camus2kafka > > Using camus2kafka, it was then possible to "stitch" the consumers (after > they're done re-processing the "replay topic") back onto the regular topic > at the precise offsets where Camus had stopped ingesting, although that > part was coded to integrate with the ZK-based high-level Kafka consumers, > not with Samza consumers. > > The part of the code that extracts the relevant offsets from Camus and > writes them to ZK is here: > https://github.com/mate1/camus2kafka/blob/master/src/main/scala/com/mate1/camus2kafka/utils/KafkaUtils.scala#L168 > > So I presume it should be relatively easy to re-purpose that part to hand > off the offsets to Samza instead. > > -- > > Felix GV > Data Infrastructure Engineer > Distributed Data Systems > LinkedIn > > f...@linkedin.com > linkedin.com/in/felixgv > > ________________________________________ > From: Zach Cox [zcox...@gmail.com] > Sent: Friday, May 29, 2015 2:13 PM > To: dev@samza.apache.org > Subject: Reprocessing old events no longer in Kafka > > Hi - > > Let's say one day a company wants to start doing all of this awesome data > integration/near-real-time stream processing stuff, so they start sending > their user activity events (e.g. pageviews, ad impressions, etc) to Kafka. > Then they hook up Camus to copy new events from Kafka to HDFS every hour. > They use the default Kafka log retention period of 7 days. So after a few > months, Kafka has the last 7 days of events, and HDFS has all events except > the newest events not yet transferred by Camus. > > Then the company wants to build out a system that uses Samza to process the > user activity events from Kafka and output it to some queryable data store. > If standard Samza reprocessing [1] is used, then only the last 7 days of > events in Kafka get processed and put into the data store. Of course, then > all future events also seamlessly get processed by the Samza jobs and put > into the data store, which is awesome. > > But let's say this company needs all of the historical events to be > processed by Samza and put into the data store (i.e. the events older than > 7 days that are in HDFS but no longer in Kafka). It's a Business Critical > thing and absolutely must happen. How should this company achieve this? > > I'm sure there are many potential solutions to this problem, but has anyone > actually done this? What approach did you take? > > Any experiences or thoughts would be hugely appreciated. > > Thanks, > Zach > > [1] http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html >