Re: Reprocessing old events no longer in Kafka

Navina Ramesh Fri, 29 May 2015 14:41:25 -0700

Hi Zach,
Regarding the JIRA, it is assigned to Jakob Homan. He will be the right
person to comment on that.


Thanks!
Navina

On 5/29/15, 2:33 PM, "Zach Cox" <[email protected]> wrote:

>Hi Navina,
>
>I did see that jira and it would definitely be useful. I was thinking of
>maybe trying to build a composite stream, that would first read old events
>from hdfs and then switch over to kafka.
>
>Do you know if there has been any movement on treating hdfs as a samza
>stream?
>
>Thanks,
>Zach
>
>On Fri, May 29, 2015 at 4:27 PM Navina Ramesh
><[email protected]>
>wrote:
>
>> Hi Zach,
>>
>> It sounds like you are asking for a SystemConsumer for hdfs. Does
>> SAMZA-263 match your requirements?
>>
>> Thanks!
>> Navina
>>
>> On 5/29/15, 2:23 PM, "Zach Cox" <[email protected]> wrote:
>>
>> >(continuing from previous email) in addition to not wanting to
>>duplicate
>> >code, say that some of the Samza jobs need to build up state, and it's
>> >important to build up this state from all of those old events no
>>longer in
>> >Kafka. If that state was only built from the last 7 days of events,
>>some
>> >things would be missing and the data would be incomplete.
>> >
>> >On Fri, May 29, 2015 at 4:20 PM Zach Cox <[email protected]> wrote:
>> >
>> >> Let's also add to the story: say the company wants to only write code
>> >>for
>> >> Samza, and not duplicate the same code in MapReduce jobs (or any
>>other
>> >> framework).
>> >>
>> >> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <[email protected]> wrote:
>> >>
>> >>> Why not run a map reduce job on the data in hdfs? what is was made
>>for.
>> >>> On May 29, 2015 2:13 PM, "Zach Cox" <[email protected]> wrote:
>> >>>
>> >>> > Hi -
>> >>> >
>> >>> > Let's say one day a company wants to start doing all of this
>>awesome
>> >>> data
>> >>> > integration/near-real-time stream processing stuff, so they start
>> >>> sending
>> >>> > their user activity events (e.g. pageviews, ad impressions, etc)
>>to
>> >>> Kafka.
>> >>> > Then they hook up Camus to copy new events from Kafka to HDFS
>>every
>> >>> hour.
>> >>> > They use the default Kafka log retention period of 7 days. So
>>after a
>> >>> few
>> >>> > months, Kafka has the last 7 days of events, and HDFS has all
>>events
>> >>> except
>> >>> > the newest events not yet transferred by Camus.
>> >>> >
>> >>> > Then the company wants to build out a system that uses Samza to
>> >>>process
>> >>> the
>> >>> > user activity events from Kafka and output it to some queryable
>>data
>> >>> store.
>> >>> > If standard Samza reprocessing [1] is used, then only the last 7
>> >>>days of
>> >>> > events in Kafka get processed and put into the data store. Of
>>course,
>> >>> then
>> >>> > all future events also seamlessly get processed by the Samza jobs
>>and
>> >>> put
>> >>> > into the data store, which is awesome.
>> >>> >
>> >>> > But let's say this company needs all of the historical events to
>>be
>> >>> > processed by Samza and put into the data store (i.e. the events
>>older
>> >>> than
>> >>> > 7 days that are in HDFS but no longer in Kafka). It's a Business
>> >>> Critical
>> >>> > thing and absolutely must happen. How should this company achieve
>> >>>this?
>> >>> >
>> >>> > I'm sure there are many potential solutions to this problem, but
>>has
>> >>> anyone
>> >>> > actually done this? What approach did you take?
>> >>> >
>> >>> > Any experiences or thoughts would be hugely appreciated.
>> >>> >
>> >>> > Thanks,
>> >>> > Zach
>> >>> >
>> >>> > [1]
>> >>> 
>>http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html
>> >>> >
>> >>>
>> >>
>>
>>

Re: Reprocessing old events no longer in Kafka

Reply via email to