Hi Zach, Regarding the JIRA, it is assigned to Jakob Homan. He will be the right person to comment on that.
Thanks! Navina On 5/29/15, 2:33 PM, "Zach Cox" <zcox...@gmail.com> wrote: >Hi Navina, > >I did see that jira and it would definitely be useful. I was thinking of >maybe trying to build a composite stream, that would first read old events >from hdfs and then switch over to kafka. > >Do you know if there has been any movement on treating hdfs as a samza >stream? > >Thanks, >Zach > >On Fri, May 29, 2015 at 4:27 PM Navina Ramesh ><nram...@linkedin.com.invalid> >wrote: > >> Hi Zach, >> >> It sounds like you are asking for a SystemConsumer for hdfs. Does >> SAMZA-263 match your requirements? >> >> Thanks! >> Navina >> >> On 5/29/15, 2:23 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >> >(continuing from previous email) in addition to not wanting to >>duplicate >> >code, say that some of the Samza jobs need to build up state, and it's >> >important to build up this state from all of those old events no >>longer in >> >Kafka. If that state was only built from the last 7 days of events, >>some >> >things would be missing and the data would be incomplete. >> > >> >On Fri, May 29, 2015 at 4:20 PM Zach Cox <zcox...@gmail.com> wrote: >> > >> >> Let's also add to the story: say the company wants to only write code >> >>for >> >> Samza, and not duplicate the same code in MapReduce jobs (or any >>other >> >> framework). >> >> >> >> On Fri, May 29, 2015 at 4:16 PM Benjamin Black <b...@b3k.us> wrote: >> >> >> >>> Why not run a map reduce job on the data in hdfs? what is was made >>for. >> >>> On May 29, 2015 2:13 PM, "Zach Cox" <zcox...@gmail.com> wrote: >> >>> >> >>> > Hi - >> >>> > >> >>> > Let's say one day a company wants to start doing all of this >>awesome >> >>> data >> >>> > integration/near-real-time stream processing stuff, so they start >> >>> sending >> >>> > their user activity events (e.g. pageviews, ad impressions, etc) >>to >> >>> Kafka. >> >>> > Then they hook up Camus to copy new events from Kafka to HDFS >>every >> >>> hour. >> >>> > They use the default Kafka log retention period of 7 days. So >>after a >> >>> few >> >>> > months, Kafka has the last 7 days of events, and HDFS has all >>events >> >>> except >> >>> > the newest events not yet transferred by Camus. >> >>> > >> >>> > Then the company wants to build out a system that uses Samza to >> >>>process >> >>> the >> >>> > user activity events from Kafka and output it to some queryable >>data >> >>> store. >> >>> > If standard Samza reprocessing [1] is used, then only the last 7 >> >>>days of >> >>> > events in Kafka get processed and put into the data store. Of >>course, >> >>> then >> >>> > all future events also seamlessly get processed by the Samza jobs >>and >> >>> put >> >>> > into the data store, which is awesome. >> >>> > >> >>> > But let's say this company needs all of the historical events to >>be >> >>> > processed by Samza and put into the data store (i.e. the events >>older >> >>> than >> >>> > 7 days that are in HDFS but no longer in Kafka). It's a Business >> >>> Critical >> >>> > thing and absolutely must happen. How should this company achieve >> >>>this? >> >>> > >> >>> > I'm sure there are many potential solutions to this problem, but >>has >> >>> anyone >> >>> > actually done this? What approach did you take? >> >>> > >> >>> > Any experiences or thoughts would be hugely appreciated. >> >>> > >> >>> > Thanks, >> >>> > Zach >> >>> > >> >>> > [1] >> >>> >>http://samza.apache.org/learn/documentation/0.9/jobs/reprocessing.html >> >>> > >> >>> >> >> >> >>