Hi Navina, Do you mean bootstrapping from hdfs as in [1]? That is an interesting idea I hadn't thought of. Maybe that could be combined with the offsets stored by Camus to determine the right place to transition to the real-time kafka stream?
Thanks, Zach [1] http://samza.apache.org/learn/documentation/0.9/container/streams.html#bootstrapping On Fri, May 29, 2015 at 4:53 PM Navina Ramesh <nram...@linkedin.com.invalid> wrote: > Hi Zach, > > I agree. It is not a good idea to keep the entire set of historical data > in Kafka. > The retention period in Kafka does make it trickier to synchronize with > your hadoop data pump. I am not very familiar with Camus2Kafka project. > But that sounds like a workable solution. > > Ideal solution would be to consume/bootstrap directly from HDFS :) > > Cheers! > Navina > > On 5/29/15, 2:44 PM, "Zach Cox" <zcox...@gmail.com> wrote: > > >Hi Navina, > > > >A similar approach I considered was using an infinite/very large retention > >period on those event kafka topics, so they would always contain all > >historical events. Then standard Samza reprocessing goes through all old > >events. > > > >I'm hesitant to pursue that though, as those topic partitions then grow > >unbounded over time, which seems problematic. > > > >Thanks, > >Zach > > > >On Fri, May 29, 2015 at 4:34 PM Navina Ramesh > ><nram...@linkedin.com.invalid> > >wrote: > > > >> That said, since we don¹t yet support consuming from hdfs, one > >>workaround > >> would be to periodically read from hdfs and pump the data to a kafka > >>topic > >> (say topic A) using a hadoop / yarn based job. Then, in your Samza job, > >> you can bootstrap from topic A and then, continue processing the latest > >> messages from the other Kafka topic. > >> > >> Thanks! > >> Navina > >> > >> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote: > >> > >> >Hi Zach, > >> > > >> >It sounds like you are asking for a SystemConsumer for hdfs. Does > >> >SAMZA-263 match your requirements? > >> > > >> >Thanks! > >> >Navina > >> > > >> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote: > >> > > >> >>(continuing from previous email) in addition to not wanting to > >>duplicate > >> >>code, say that some of the Samza jobs need to build up state, and it's > >> >>important to build up this state from all of those old events no > >>longer > >> >>in > >> >>Kafka. If that state was only built from the last 7 days of events, > >>some > >> >>things would be missing and the data would be incomplete. > >> >> > >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote: > >> >> > >> >>> Let's also add to the story: say the company wants to only write > >>code > >> >>>for > >> >>> Samza, and not duplicate the same code in MapReduce jobs (or any > >>other > >> >>> framework). > >> >>> > >> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: > >> >>> > >> >>>> Why not run a map reduce job on the data in hdfs? what is was made > >> >>>>for. > >> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: > >> >>>> > >> >>>> > Hi - > >> >>>> > > >> >>>> > Let's say one day a company wants to start doing all of this > >>awesome > >> >>>> data > >> >>>> > integration/near-real-time stream processing stuff, so they start > >> >>>> sending > >> >>>> > their user activity events (e.g. pageviews, ad impressions, etc) > >>to > >> >>>> Kafka. > >> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS > >>every > >> >>>> hour. > >> >>>> > They use the default Kafka log retention period of 7 days. So > >>after > >> >>>>a > >> >>>> few > >> >>>> > months, Kafka has the last 7 days of events, and HDFS has all > >>events > >> >>>> except > >> >>>> > the newest events not yet transferred by Camus. > >> >>>> > > >> >>>> > Then the company wants to build out a system that uses Samza to > >> >>>>process > >> >>>> the > >> >>>> > user activity events from Kafka and output it to some queryable > >>data > >> >>>> store. > >> >>>> > If standard Samza reprocessing [1] is used, then only the last 7 > >> >>>>days of > >> >>>> > events in Kafka get processed and put into the data store. Of > >> >>>>course, > >> >>>> then > >> >>>> > all future events also seamlessly get processed by the Samza jobs > >> >>>>and > >> >>>> put > >> >>>> > into the data store, which is awesome. > >> >>>> > > >> >>>> > But let's say this company needs all of the historical events to > >>be > >> >>>> > processed by Samza and put into the data store (i.e. the events > >> >>>>older > >> >>>> than > >> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a Business > >> >>>> Critical > >> >>>> > thing and absolutely must happen. How should this company achieve > >> >>>>this? > >> >>>> > > >> >>>> > I'm sure there are many potential solutions to this problem, but > >>has > >> >>>> anyone > >> >>>> > actually done this? What approach did you take? > >> >>>> > > >> >>>> > Any experiences or thoughts would be hugely appreciated. > >> >>>> > > >> >>>> > Thanks, > >> >>>> > Zach > >> >>>> > > >> >>>> > [1] > >> >>>> > >> http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html > >> >>>> > > >> >>>> > >> >>> > >> > > >> > >> > >