Hi Zach, Yes. Bootstrapping from hdfs directly is possible, provided you have a hdfs consumer defined (SAMZA-263 is a pre-requisite). :) If you have a way of getting hdfs data into Kafka, then you can use it as a bootstrap stream. However, a bootstrap stream is a temporarily privilege and always consumes all messages in the stream.
If you want to use the offsets stored by Camus to determine the right place to transition to realtime Kafka, then you might consider implementing a custom MessageChooser [1]. This will give you more fine-grained control over message selection from your input streams. Cheers! Navina [1] https://github.com/apache/samza/blob/master/samza-api/src/main/java/org/apa che/samza/system/chooser/MessageChooser.java#L26 On 5/29/15, 3:07 PM, "Zach Cox" <zcox...@gmail.com> wrote: >Hi Navina, > >Do you mean bootstrapping from hdfs as in [1]? That is an interesting idea >I hadn't thought of. Maybe that could be combined with the offsets stored >by Camus to determine the right place to transition to the real-time kafka >stream? > >Thanks, >Zach > >[1] >http://samza.apache.org/learn/documentation/0.9/container/streams.html#boo >tstrapping > > >On Fri, May 29, 2015 at 4:53 PM Navina Ramesh ><nram...@linkedin.com.invalid> >wrote: > >> Hi Zach, >> >> I agree. It is not a good idea to keep the entire set of historical data >> in Kafka. >> The retention period in Kafka does make it trickier to synchronize with >> your hadoop data pump. I am not very familiar with Camus2Kafka project. >> But that sounds like a workable solution. >> >> Ideal solution would be to consume/bootstrap directly from HDFS :) >> >> Cheers! >> Navina >> >> On 5/29/15, 2:44 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >> >Hi Navina, >> > >> >A similar approach I considered was using an infinite/very large >>retention >> >period on those event kafka topics, so they would always contain all >> >historical events. Then standard Samza reprocessing goes through all >>old >> >events. >> > >> >I'm hesitant to pursue that though, as those topic partitions then grow >> >unbounded over time, which seems problematic. >> > >> >Thanks, >> >Zach >> > >> >On Fri, May 29, 2015 at 4:34 PM Navina Ramesh >> ><nram...@linkedin.com.invalid> >> >wrote: >> > >> >> That said, since we don¹t yet support consuming from hdfs, one >> >>workaround >> >> would be to periodically read from hdfs and pump the data to a kafka >> >>topic >> >> (say topic A) using a hadoop / yarn based job. Then, in your Samza >>job, >> >> you can bootstrap from topic A and then, continue processing the >>latest >> >> messages from the other Kafka topic. >> >> >> >> Thanks! >> >> Navina >> >> >> >> On 5/29/15, 2:26 PM, "Navina Ramesh" <nram...@linkedin.com> wrote: >> >> >> >> >Hi Zach, >> >> > >> >> >It sounds like you are asking for a SystemConsumer for hdfs. Does >> >> >SAMZA-263 match your requirements? >> >> > >> >> >Thanks! >> >> >Navina >> >> > >> >> >On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >> > >> >> >>(continuing from previous email) in addition to not wanting to >> >>duplicate >> >> >>code, say that some of the Samza jobs need to build up state, and >>it's >> >> >>important to build up this state from all of those old events no >> >>longer >> >> >>in >> >> >>Kafka. If that state was only built from the last 7 days of events, >> >>some >> >> >>things would be missing and the data would be incomplete. >> >> >> >> >> >>On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote: >> >> >> >> >> >>> Let's also add to the story: say the company wants to only write >> >>code >> >> >>>for >> >> >>> Samza, and not duplicate the same code in MapReduce jobs (or any >> >>other >> >> >>> framework). >> >> >>> >> >> >>> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: >> >> >>> >> >> >>>> Why not run a map reduce job on the data in hdfs? what is was >>made >> >> >>>>for. >> >> >>>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >> >>>> >> >> >>>> > Hi - >> >> >>>> > >> >> >>>> > Let's say one day a company wants to start doing all of this >> >>awesome >> >> >>>> data >> >> >>>> > integration/near-real-time stream processing stuff, so they >>start >> >> >>>> sending >> >> >>>> > their user activity events (e.g. pageviews, ad impressions, >>etc) >> >>to >> >> >>>> Kafka. >> >> >>>> > Then they hook up Camus to copy new events from Kafka to HDFS >> >>every >> >> >>>> hour. >> >> >>>> > They use the default Kafka log retention period of 7 days. So >> >>after >> >> >>>>a >> >> >>>> few >> >> >>>> > months, Kafka has the last 7 days of events, and HDFS has all >> >>events >> >> >>>> except >> >> >>>> > the newest events not yet transferred by Camus. >> >> >>>> > >> >> >>>> > Then the company wants to build out a system that uses Samza >>to >> >> >>>>process >> >> >>>> the >> >> >>>> > user activity events from Kafka and output it to some >>queryable >> >>data >> >> >>>> store. >> >> >>>> > If standard Samza reprocessing [1] is used, then only the >>last 7 >> >> >>>>days of >> >> >>>> > events in Kafka get processed and put into the data store. Of >> >> >>>>course, >> >> >>>> then >> >> >>>> > all future events also seamlessly get processed by the Samza >>jobs >> >> >>>>and >> >> >>>> put >> >> >>>> > into the data store, which is awesome. >> >> >>>> > >> >> >>>> > But let's say this company needs all of the historical events >>to >> >>be >> >> >>>> > processed by Samza and put into the data store (i.e. the >>events >> >> >>>>older >> >> >>>> than >> >> >>>> > 7 days that are in HDFS but no longer in Kafka). It's a >>Business >> >> >>>> Critical >> >> >>>> > thing and absolutely must happen. How should this company >>achieve >> >> >>>>this? >> >> >>>> > >> >> >>>> > I'm sure there are many potential solutions to this problem, >>but >> >>has >> >> >>>> anyone >> >> >>>> > actually done this? What approach did you take? >> >> >>>> > >> >> >>>> > Any experiences or thoughts would be hugely appreciated. >> >> >>>> > >> >> >>>> > Thanks, >> >> >>>> > Zach >> >> >>>> > >> >> >>>> > [1] >> >> >>>> >> >> >>http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html >> >> >>>> > >> >> >>>> >> >> >>> >> >> > >> >> >> >> >> >>