Hi Navina,

A similar approach I considered was using an infinite/very large retention
period on those event kafka topics, so they would always contain all
historical events. Then standard Samza reprocessing goes through all old
events.

I'm hesitant to pursue that though, as those topic partitions then grow
unbounded over time, which seems problematic.

Thanks,
Zach

On Fri, May 29, 2015 at 4:34 PM Navina Ramesh <nram...@linkedin.com.invalid>
wrote:

> That said, since we don¹t yet support consuming from hdfs, one workaround
> would be to periodically read from hdfs and pump the data to a kafka topic
> (say topic A) using a hadoop / yarn based job. Then, in your Samza job,
> you can bootstrap from topic A and then, continue processing the latest
> messages from the other Kafka topic.
>
> Thanks!
> Navina
>
> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote:
>
> >Hi Zach,
> >
> >It sounds like you are asking for a SystemConsumer for hdfs. Does
> >SAMZA-263 match your requirements?
> >
> >Thanks!
> >Navina
> >
> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote:
> >
> >>(continuing from previous email) in addition to not wanting to duplicate
> >>code, say that some of the Samza jobs need to build up state, and it's
> >>important to build up this state from all of those old events no longer
> >>in
> >>Kafka. If that state was only built from the last 7 days of events, some
> >>things would be missing and the data would be incomplete.
> >>
> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote:
> >>
> >>> Let's also add to the story: say the company wants to only write code
> >>>for
> >>> Samza, and not duplicate the same code in MapReduce jobs (or any other
> >>> framework).
> >>>
> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:
> >>>
> >>>> Why not run a map reduce job on the data in hdfs? what is was made
> >>>>for.
> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
> >>>>
> >>>> > Hi -
> >>>> >
> >>>> > Let's say one day a company wants to start doing all of this awesome
> >>>> data
> >>>> > integration/near-real-time stream processing stuff, so they start
> >>>> sending
> >>>> > their user activity events (e.g. pageviews, ad impressions, etc) to
> >>>> Kafka.
> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS every
> >>>> hour.
> >>>> > They use the default Kafka log retention period of 7 days. So after
> >>>>a
> >>>> few
> >>>> > months, Kafka has the last 7 days of events, and HDFS has all events
> >>>> except
> >>>> > the newest events not yet transferred by Camus.
> >>>> >
> >>>> > Then the company wants to build out a system that uses Samza to
> >>>>process
> >>>> the
> >>>> > user activity events from Kafka and output it to some queryable data
> >>>> store.
> >>>> > If standard Samza reprocessing [1] is used, then only the last 7
> >>>>days of
> >>>> > events in Kafka get processed and put into the data store. Of
> >>>>course,
> >>>> then
> >>>> > all future events also seamlessly get processed by the Samza jobs
> >>>>and
> >>>> put
> >>>> > into the data store, which is awesome.
> >>>> >
> >>>> > But let's say this company needs all of the historical events to be
> >>>> > processed by Samza and put into the data store (i.e. the events
> >>>>older
> >>>> than
> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
> >>>> Critical
> >>>> > thing and absolutely must happen. How should this company achieve
> >>>>this?
> >>>> >
> >>>> > I'm sure there are many potential solutions to this problem, but has
> >>>> anyone
> >>>> > actually done this? What approach did you take?
> >>>> >
> >>>> > Any experiences or thoughts would be hugely appreciated.
> >>>> >
> >>>> > Thanks,
> >>>> > Zach
> >>>> >
> >>>> > [1]
> >>>>
> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
> >>>> >
> >>>>
> >>>
> >
>
>

Reply via email to