We developed camus2kafka at my previous job for this purpose of re-pushing events to Kafka: https://github.com/mate1/camus2kafka
Using camus2kafka, it was then possible to "stitch" the consumers (after they're done re-processing the "replay topic") back onto the regular topic at the precise offsets where Camus had stopped ingesting, although that part was coded to integrate with the ZK-based high-level Kafka consumers, not with Samza consumers. The part of the code that extracts the relevant offsets from Camus and writes them to ZK is here: https://github.com/mate1/camus2kafka/blob/master/src/main/scala/com/mate1/camus2kafka/utils/KafkaUtils.scala#L168 So I presume it should be relatively easy to re-purpose that part to hand off the offsets to Samza instead. -- Felix GV Data Infrastructure Engineer Distributed Data Systems LinkedIn f...@linkedin.com linkedin.com/in/felixgv ________________________________________ From: Zach Cox [zcox...@gmail.com] Sent: Friday, May 29, 2015 2:13 PM To: dev@samza.apache.org Subject: Reprocessing old events no longer in Kafka Hi - Let's say one day a company wants to start doing all of this awesome data integration/near-real-time stream processing stuff, so they start sending their user activity events (e.g. pageviews, ad impressions, etc) to Kafka. Then they hook up Camus to copy new events from Kafka to HDFS every hour. They use the default Kafka log retention period of 7 days. So after a few months, Kafka has the last 7 days of events, and HDFS has all events except the newest events not yet transferred by Camus. Then the company wants to build out a system that uses Samza to process the user activity events from Kafka and output it to some queryable data store. If standard Samza reprocessing [1] is used, then only the last 7 days of events in Kafka get processed and put into the data store. Of course, then all future events also seamlessly get processed by the Samza jobs and put into the data store, which is awesome. But let's say this company needs all of the historical events to be processed by Samza and put into the data store (i.e. the events older than 7 days that are in HDFS but no longer in Kafka). It's a Business Critical thing and absolutely must happen. How should this company achieve this? I'm sure there are many potential solutions to this problem, but has anyone actually done this? What approach did you take? Any experiences or thoughts would be hugely appreciated. Thanks, Zach [1] http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html