We developed camus2kafka at my previous job for this purpose of re-pushing 
events to Kafka: https://github.com/mate1/camus2kafka

Using camus2kafka, it was then possible to "stitch" the consumers (after 
they're done re-processing the "replay topic") back onto the regular topic at 
the precise offsets where Camus had stopped ingesting, although that part was 
coded to integrate with the ZK-based high-level Kafka consumers, not with Samza 
consumers.

The part of the code that extracts the relevant offsets from Camus and writes 
them to ZK is here: 
https://github.com/mate1/camus2kafka/blob/master/src/main/scala/com/mate1/camus2kafka/utils/KafkaUtils.scala#L168

So I presume it should be relatively easy to re-purpose that part to hand off 
the offsets to Samza instead.

--

Felix GV
Data Infrastructure Engineer
Distributed Data Systems
LinkedIn

f...@linkedin.com
linkedin.com/in/felixgv

________________________________________
From: Zach Cox [zcox...@gmail.com]
Sent: Friday, May 29, 2015 2:13 PM
To: dev@samza.apache.org
Subject: Reprocessing old events no longer in Kafka

Hi -

Let's say one day a company wants to start doing all of this awesome data
integration/near-real-time stream processing stuff, so they start sending
their user activity events (e.g. pageviews, ad impressions, etc) to Kafka.
Then they hook up Camus to copy new events from Kafka to HDFS every hour.
They use the default Kafka log retention period of 7 days. So after a few
months, Kafka has the last 7 days of events, and HDFS has all events except
the newest events not yet transferred by Camus.

Then the company wants to build out a system that uses Samza to process the
user activity events from Kafka and output it to some queryable data store.
If standard Samza reprocessing [1] is used, then only the last 7 days of
events in Kafka get processed and put into the data store. Of course, then
all future events also seamlessly get processed by the Samza jobs and put
into the data store, which is awesome.

But let's say this company needs all of the historical events to be
processed by Samza and put into the data store (i.e. the events older than
7 days that are in HDFS but no longer in Kafka). It's a Business Critical
thing and absolutely must happen. How should this company achieve this?

I'm sure there are many potential solutions to this problem, but has anyone
actually done this? What approach did you take?

Any experiences or thoughts would be hugely appreciated.

Thanks,
Zach

[1] http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html

Reply via email to