Re: Reprocessing old events no longer in Kafka

Zach Cox Fri, 29 May 2015 15:09:00 -0700

Hi Navina,

Do you mean bootstrapping from hdfs as in [1]? That is an interesting idea
I hadn't thought of. Maybe that could be combined with the offsets stored
by Camus to determine the right place to transition to the real-time kafka
stream?


Thanks,
Zach

[1]
http://samza.apache.org/learn/documentation/0.9/container/streams.html#bootstrapping


On Fri, May 29, 2015 at 4:53 PM Navina Ramesh <nram...@linkedin.com.invalid>
wrote:

> Hi Zach,
>
> I agree. It is not a good idea to keep the entire set of historical data
> in Kafka.
> The retention period in Kafka does make it trickier to synchronize with
> your hadoop data pump. I am not very familiar with Camus2Kafka project.
> But that sounds like a workable solution.
>
> Ideal solution would be to consume/bootstrap directly from HDFS :)
>
> Cheers!
> Navina
>
> On 5/29/15, 2:44 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>
> >Hi Navina,
> >
> >A similar approach I considered was using an infinite/very large retention
> >period on those event kafka topics, so they would always contain all
> >historical events. Then standard Samza reprocessing goes through all old
> >events.
> >
> >I'm hesitant to pursue that though, as those topic partitions then grow
> >unbounded over time, which seems problematic.
> >
> >Thanks,
> >Zach
> >
> >On Fri, May 29, 2015 at 4:34 PM Navina Ramesh
> ><nram...@linkedin.com.invalid>
> >wrote:
> >
> >> That said, since we don¹t yet support consuming from hdfs, one
> >>workaround
> >> would be to periodically read from hdfs and pump the data to a kafka
> >>topic
> >> (say topic A) using a hadoop / yarn based job. Then, in your Samza job,
> >> you can bootstrap from topic A and then, continue processing the latest
> >> messages from the other Kafka topic.
> >>
> >> Thanks!
> >> Navina
> >>
> >> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote:
> >>
> >> >Hi Zach,
> >> >
> >> >It sounds like you are asking for a SystemConsumer for hdfs. Does
> >> >SAMZA-263 match your requirements?
> >> >
> >> >Thanks!
> >> >Navina
> >> >
> >> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote:
> >> >
> >> >>(continuing from previous email) in addition to not wanting to
> >>duplicate
> >> >>code, say that some of the Samza jobs need to build up state, and it's
> >> >>important to build up this state from all of those old events no
> >>longer
> >> >>in
> >> >>Kafka. If that state was only built from the last 7 days of events,
> >>some
> >> >>things would be missing and the data would be incomplete.
> >> >>
> >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote:
> >> >>
> >> >>> Let's also add to the story: say the company wants to only write
> >>code
> >> >>>for
> >> >>> Samza, and not duplicate the same code in MapReduce jobs (or any
> >>other
> >> >>> framework).
> >> >>>
> >> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:
> >> >>>
> >> >>>> Why not run a map reduce job on the data in hdfs? what is was made
> >> >>>>for.
> >> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
> >> >>>>
> >> >>>> > Hi -
> >> >>>> >
> >> >>>> > Let's say one day a company wants to start doing all of this
> >>awesome
> >> >>>> data
> >> >>>> > integration/near-real-time stream processing stuff, so they start
> >> >>>> sending
> >> >>>> > their user activity events (e.g. pageviews, ad impressions, etc)
> >>to
> >> >>>> Kafka.
> >> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS
> >>every
> >> >>>> hour.
> >> >>>> > They use the default Kafka log retention period of 7 days. So
> >>after
> >> >>>>a
> >> >>>> few
> >> >>>> > months, Kafka has the last 7 days of events, and HDFS has all
> >>events
> >> >>>> except
> >> >>>> > the newest events not yet transferred by Camus.
> >> >>>> >
> >> >>>> > Then the company wants to build out a system that uses Samza to
> >> >>>>process
> >> >>>> the
> >> >>>> > user activity events from Kafka and output it to some queryable
> >>data
> >> >>>> store.
> >> >>>> > If standard Samza reprocessing [1] is used, then only the last 7
> >> >>>>days of
> >> >>>> > events in Kafka get processed and put into the data store. Of
> >> >>>>course,
> >> >>>> then
> >> >>>> > all future events also seamlessly get processed by the Samza jobs
> >> >>>>and
> >> >>>> put
> >> >>>> > into the data store, which is awesome.
> >> >>>> >
> >> >>>> > But let's say this company needs all of the historical events to
> >>be
> >> >>>> > processed by Samza and put into the data store (i.e. the events
> >> >>>>older
> >> >>>> than
> >> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
> >> >>>> Critical
> >> >>>> > thing and absolutely must happen. How should this company achieve
> >> >>>>this?
> >> >>>> >
> >> >>>> > I'm sure there are many potential solutions to this problem, but
> >>has
> >> >>>> anyone
> >> >>>> > actually done this? What approach did you take?
> >> >>>> >
> >> >>>> > Any experiences or thoughts would be hugely appreciated.
> >> >>>> >
> >> >>>> > Thanks,
> >> >>>> > Zach
> >> >>>> >
> >> >>>> > [1]
> >> >>>>
> >> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> >> >>>> >
> >> >>>>
> >> >>>
> >> >
> >>
> >>
>
>

Re: Reprocessing old events no longer in Kafka

Reply via email to