Hi Felix - that sounds like a very interesting approach! I was trying to
think through how to dump the events hdfs => kafka but got stuck on the
transition point, didn't even think about using the offsets from camus. I
will definitely investigate further.

Thanks,
Zach


On Fri, May 29, 2015 at 4:25 PM Felix GV <fville...@linkedin.com.invalid>
wrote:

> We developed camus2kafka at my previous job for this purpose of re-pushing
> events to Kafka: https://github.com/mate1/camus2kafka
>
> Using camus2kafka, it was then possible to "stitch" the consumers (after
> they're done re-processing the "replay topic") back onto the regular topic
> at the precise offsets where Camus had stopped ingesting, although that
> part was coded to integrate with the ZK-based high-level Kafka consumers,
> not with Samza consumers.
>
> The part of the code that extracts the relevant offsets from Camus and
> writes them to ZK is here:
> https://github.com/mate1/camus2kafka/blob/master/src/main/scala/com/mate1/camus2kafka/utils/KafkaUtils.scala#L168
>
> So I presume it should be relatively easy to re-purpose that part to hand
> off the offsets to Samza instead.
>
> --
>
> Felix GV
> Data Infrastructure Engineer
> Distributed Data Systems
> LinkedIn
>
> f...@linkedin.com
> linkedin.com/in/felixgv
>
> ________________________________________
> From: Zach Cox [zcox...@gmail.com]
> Sent: Friday, May 29, 2015 2:13 PM
> To: dev@samza.apache.org
> Subject: Reprocessing old events no longer in Kafka
>
> Hi -
>
> Let's say one day a company wants to start doing all of this awesome data
> integration/near-real-time stream processing stuff, so they start sending
> their user activity events (e.g. pageviews, ad impressions, etc) to Kafka.
> Then they hook up Camus to copy new events from Kafka to HDFS every hour.
> They use the default Kafka log retention period of 7 days. So after a few
> months, Kafka has the last 7 days of events, and HDFS has all events except
> the newest events not yet transferred by Camus.
>
> Then the company wants to build out a system that uses Samza to process the
> user activity events from Kafka and output it to some queryable data store.
> If standard Samza reprocessing [1] is used, then only the last 7 days of
> events in Kafka get processed and put into the data store. Of course, then
> all future events also seamlessly get processed by the Samza jobs and put
> into the data store, which is awesome.
>
> But let's say this company needs all of the historical events to be
> processed by Samza and put into the data store (i.e. the events older than
> 7 days that are in HDFS but no longer in Kafka). It's a Business Critical
> thing and absolutely must happen. How should this company achieve this?
>
> I'm sure there are many potential solutions to this problem, but has anyone
> actually done this? What approach did you take?
>
> Any experiences or thoughts would be hugely appreciated.
>
> Thanks,
> Zach
>
> [1] http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
>

Reply via email to