Hi Zach,

I agree. It is not a good idea to keep the entire set of historical data
in Kafka. 
The retention period in Kafka does make it trickier to synchronize with
your hadoop data pump. I am not very familiar with Camus2Kafka project.
But that sounds like a workable solution.

Ideal solution would be to consume/bootstrap directly from HDFS :)

Cheers!
Navina 

On 5/29/15, 2:44 PM, "Zach Cox" <zcox...@gmail.com> wrote:

>Hi Navina,
>
>A similar approach I considered was using an infinite/very large retention
>period on those event kafka topics, so they would always contain all
>historical events. Then standard Samza reprocessing goes through all old
>events.
>
>I'm hesitant to pursue that though, as those topic partitions then grow
>unbounded over time, which seems problematic.
>
>Thanks,
>Zach
>
>On Fri, May 29, 2015 at 4:34 PM Navina Ramesh
><nram...@linkedin.com.invalid>
>wrote:
>
>> That said, since we don¹t yet support consuming from hdfs, one
>>workaround
>> would be to periodically read from hdfs and pump the data to a kafka
>>topic
>> (say topic A) using a hadoop / yarn based job. Then, in your Samza job,
>> you can bootstrap from topic A and then, continue processing the latest
>> messages from the other Kafka topic.
>>
>> Thanks!
>> Navina
>>
>> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote:
>>
>> >Hi Zach,
>> >
>> >It sounds like you are asking for a SystemConsumer for hdfs. Does
>> >SAMZA-263 match your requirements?
>> >
>> >Thanks!
>> >Navina
>> >
>> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>> >
>> >>(continuing from previous email) in addition to not wanting to
>>duplicate
>> >>code, say that some of the Samza jobs need to build up state, and it's
>> >>important to build up this state from all of those old events no
>>longer
>> >>in
>> >>Kafka. If that state was only built from the last 7 days of events,
>>some
>> >>things would be missing and the data would be incomplete.
>> >>
>> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote:
>> >>
>> >>> Let's also add to the story: say the company wants to only write
>>code
>> >>>for
>> >>> Samza, and not duplicate the same code in MapReduce jobs (or any
>>other
>> >>> framework).
>> >>>
>> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote:
>> >>>
>> >>>> Why not run a map reduce job on the data in hdfs? what is was made
>> >>>>for.
>> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote:
>> >>>>
>> >>>> > Hi -
>> >>>> >
>> >>>> > Let's say one day a company wants to start doing all of this
>>awesome
>> >>>> data
>> >>>> > integration/near-real-time stream processing stuff, so they start
>> >>>> sending
>> >>>> > their user activity events (e.g. pageviews, ad impressions, etc)
>>to
>> >>>> Kafka.
>> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS
>>every
>> >>>> hour.
>> >>>> > They use the default Kafka log retention period of 7 days. So
>>after
>> >>>>a
>> >>>> few
>> >>>> > months, Kafka has the last 7 days of events, and HDFS has all
>>events
>> >>>> except
>> >>>> > the newest events not yet transferred by Camus.
>> >>>> >
>> >>>> > Then the company wants to build out a system that uses Samza to
>> >>>>process
>> >>>> the
>> >>>> > user activity events from Kafka and output it to some queryable
>>data
>> >>>> store.
>> >>>> > If standard Samza reprocessing [1] is used, then only the last 7
>> >>>>days of
>> >>>> > events in Kafka get processed and put into the data store. Of
>> >>>>course,
>> >>>> then
>> >>>> > all future events also seamlessly get processed by the Samza jobs
>> >>>>and
>> >>>> put
>> >>>> > into the data store, which is awesome.
>> >>>> >
>> >>>> > But let's say this company needs all of the historical events to
>>be
>> >>>> > processed by Samza and put into the data store (i.e. the events
>> >>>>older
>> >>>> than
>> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
>> >>>> Critical
>> >>>> > thing and absolutely must happen. How should this company achieve
>> >>>>this?
>> >>>> >
>> >>>> > I'm sure there are many potential solutions to this problem, but
>>has
>> >>>> anyone
>> >>>> > actually done this? What approach did you take?
>> >>>> >
>> >>>> > Any experiences or thoughts would be hugely appreciated.
>> >>>> >
>> >>>> > Thanks,
>> >>>> > Zach
>> >>>> >
>> >>>> > [1]
>> >>>>
>> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
>> >>>> >
>> >>>>
>> >>>
>> >
>>
>>

Reply via email to