Hi Felix - that is an excellent point: let's assume in this case that the Samza processing is not perfectly idempotent, so duplicating 7 days of events is not OK, but there is tolerance for duplicating several seconds/minutes of events, or handling events out-of-order for a few seconds, etc.
Thanks, Zach On Fri, May 29, 2015 at 4:44 PM Zach Cox <zcox...@gmail.com> wrote: > Hi Navina, > > A similar approach I considered was using an infinite/very large retention > period on those event kafka topics, so they would always contain all > historical events. Then standard Samza reprocessing goes through all old > events. > > I'm hesitant to pursue that though, as those topic partitions then grow > unbounded over time, which seems problematic. > > Thanks, > Zach > > > On Fri, May 29, 2015 at 4:34 PM Navina Ramesh <nram...@linkedin.com.invalid> > wrote: > >> That said, since we don¹t yet support consuming from hdfs, one workaround >> would be to periodically read from hdfs and pump the data to a kafka topic >> (say topic A) using a hadoop / yarn based job. Then, in your Samza job, >> you can bootstrap from topic A and then, continue processing the latest >> messages from the other Kafka topic. >> >> Thanks! >> Navina >> >> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote: >> >> >Hi Zach, >> > >> >It sounds like you are asking for a SystemConsumer for hdfs. Does >> >SAMZA-263 match your requirements? >> > >> >Thanks! >> >Navina >> > >> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> > >> >>(continuing from previous email) in addition to not wanting to duplicate >> >>code, say that some of the Samza jobs need to build up state, and it's >> >>important to build up this state from all of those old events no longer >> >>in >> >>Kafka. If that state was only built from the last 7 days of events, some >> >>things would be missing and the data would be incomplete. >> >> >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote: >> >> >> >>> Let's also add to the story: say the company wants to only write code >> >>>for >> >>> Samza, and not duplicate the same code in MapReduce jobs (or any other >> >>> framework). >> >>> >> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: >> >>> >> >>>> Why not run a map reduce job on the data in hdfs? what is was made >> >>>>for. >> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >>>> >> >>>> > Hi - >> >>>> > >> >>>> > Let's say one day a company wants to start doing all of this >> awesome >> >>>> data >> >>>> > integration/near-real-time stream processing stuff, so they start >> >>>> sending >> >>>> > their user activity events (e.g. pageviews, ad impressions, etc) to >> >>>> Kafka. >> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS every >> >>>> hour. >> >>>> > They use the default Kafka log retention period of 7 days. So after >> >>>>a >> >>>> few >> >>>> > months, Kafka has the last 7 days of events, and HDFS has all >> events >> >>>> except >> >>>> > the newest events not yet transferred by Camus. >> >>>> > >> >>>> > Then the company wants to build out a system that uses Samza to >> >>>>process >> >>>> the >> >>>> > user activity events from Kafka and output it to some queryable >> data >> >>>> store. >> >>>> > If standard Samza reprocessing [1] is used, then only the last 7 >> >>>>days of >> >>>> > events in Kafka get processed and put into the data store. Of >> >>>>course, >> >>>> then >> >>>> > all future events also seamlessly get processed by the Samza jobs >> >>>>and >> >>>> put >> >>>> > into the data store, which is awesome. >> >>>> > >> >>>> > But let's say this company needs all of the historical events to be >> >>>> > processed by Samza and put into the data store (i.e. the events >> >>>>older >> >>>> than >> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business >> >>>> Critical >> >>>> > thing and absolutely must happen. How should this company achieve >> >>>>this? >> >>>> > >> >>>> > I'm sure there are many potential solutions to this problem, but >> has >> >>>> anyone >> >>>> > actually done this? What approach did you take? >> >>>> > >> >>>> > Any experiences or thoughts would be hugely appreciated. >> >>>> > >> >>>> > Thanks, >> >>>> > Zach >> >>>> > >> >>>> > [1] >> >>>> >> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html >> >>>> > >> >>>> >> >>> >> > >> >>