@Ken: I was going to suggest batch processing with Samza, which is pretty much what you just said. Thanks for your valuable input. :)
@Michael: I think the pattern I suggested will not work out for your data scale. Following a batch processing model with Samza can fulfill the requirements of your use-case. Cheers! Navina On Mon, Sep 21, 2015 at 10:28 PM, Jordan Shaw <jor...@pubnub.com> wrote: > Michael, > Why not just have a pool of workers outside of Samza that are pushing the > raw, or subset of the raw crawler input into a Kafka topic then have the > Samza do the compute/stream work? Basically Samza is not the right tool for > what your suggesting but could be used for downstream work, in my opinion. > -Jordan > > On Mon, Sep 21, 2015 at 9:08 AM, Ken Krugler <kkrugler_li...@transpac.com> > wrote: > > > Hi Michael (& Navina), > > > > I don't think you need to create a separate background process, at least > > for the case of web crawling. > > > > The challenge is to efficiently use one Samza process to simultaneously > > fetch many URLs. > > > > Which does increase the complexity of that process's code, as you wind up > > having to manage either a multi-threaded or async fetch state. > > > > But that's the same as for Hadoop-based crawlers, where you have a > limited > > number of parallel reduce tasks that are doing the fetching - see Nutch > and > > Bixo for examples, e.g. FetchBuffer. > > > > And it's the same for storm-crawler, another project I've been involved > > with in the past. > > > > -- Ken > > > > > From: Michael Sklyar > > > Sent: September 21, 2015 5:19:52am PDT > > > To: dev@samza.apache.org > > > Subject: Re: Asynchronous approach and samza > > > > > > Thanks Navina, > > > it is much more clear now. > > > > > > Unfortunately, in our case, we can not bootstrap the data in advance(we > > > can't pre-fetch all existing URL's titles and headers in advance). > > > Sounds to me that, if we want to use Samza, we will need a background > > > process that will be synchronized with the main event loop of the task > > > (+hande back-pressure so not more than X requests can be made > > > simultaneously). > > > > > > > > > Regards, > > > Michael > > > > > > On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh < > > > nram...@linkedin.com.invalid> wrote: > > > > > >> Hi Michael, > > >> {quote} > > >> Do you mean that in such a case Samza should be combined with another > > >> Stream processing framework (such as Storm)? > > >> {quote} > > >> No. I didn't mean combining it with any other framework. > > >> > > >> {quote} > > >> "the job bootstraps the data from the source" - do you mean that > > >> you have a background process for this purpose or just listen to an > > >> additional stream of change log from some other framework? > > >> {quote} > > >> I didn't mean a background process. I meant just listening from a > > stream of > > >> change log from a data source. > > >> > > >> At LinkedIn, we use databus. The jobs will configure databus (for a > give > > >> data source) as one of the input streams for the job. Databus is a > > source > > >> agnostic distributed change data capture system. You can find more > > >> information here <https://github.com/linkedin/databus>. The advantage > > is > > >> that the databus client is capable of "bootstrapping" from the source > > >> automatically and then, switching to simply capture changes from the > > data > > >> source. In this scenario, Samza doesn't do anything special, except > > that it > > >> will continue consuming from databus stream when bootstrapping. Once > > >> bootstrap is complete, the job can start processing events from other > > input > > >> streams as well. > > >> > > >> I hope my explanation clarifies your question. :) > > >> > > >> Thanks! > > >> Navina > > >> > > >> > > >> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mikesk...@gmail.com> > > >> wrote: > > >> > > >>> Thank you for your replies, > > >>> > > >>> I understand that making an external blocking request in a single > event > > >>> thread will result in extremely low throughput. However this can be > > >> solved > > >>> by multi threading and/or asynchronous approach. It is clear that in > > any > > >>> case using external services can never achieve the throughput of > simple > > >>> transformations. However most stream processing need, from time to > > time, > > >> to > > >>> query some external storage, web service etc... > > >>> > > >>> Do you mean that in such a case Samza should be combined with another > > >>> Stream processing framework (such as Storm)? > > >>> > > >>> Navina, "the job bootstraps the data from the source" - do you mean > > that > > >>> you have a background process for this purpose or just listen to an > > >>> additional stream of change log from some other framework? > > >>> > > >>> Thanks, > > >>> Michael > > >>> > > >>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh > > >>> <nram...@linkedin.com.invalid > > >>>> wrote: > > >>> > > >>>> Hi Michael, > > >>>> I agree with what Yan said. While nothing stops you from doing it, > it > > >> is > > >>>> not encouraged as it affect throughput and realtime processing. > > >>>> > > >>>> {quote} > > >>>> It seems that Samza design suits very well "data transformation" > > >>> scenarios, > > >>>> what is not clear is how well can it support external services? > > >>>> {quote} > > >>>> We have some similar use-cases at LinkedIn where the Samza jobs need > > to > > >>>> query to external data sources. We do use a pattern where the job > > >>>> bootstraps the data from the source using a change-capture system > like > > >>>> databus and buffer it locally, before processing from input streams. > > >>>> Depending on the scale of your data, this model may or may not work > > for > > >>>> you. However, there is no in-built support for this in Samza. > > >>>> > > >>>> Thanks! > > >>>> Navina > > >>>> > > >>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com> > > >> wrote: > > >>>> > > >>>>> Hi Michael, > > >>>>> > > >>>>> Samza is designed for high-throughput and realtime processing. If > you > > >>> are > > >>>>> using HTTP request/external service, you may not retrieve the same > > >>>>> performance as not using it. However, technically speaking, there > is > > >>>>> nothing blocking you to do this, (well, discouraged anyway :). > Samza > > >> by > > >>>>> default does not provide this feature. So you maybe a little > cautious > > >>>> when > > >>>>> implementing this. > > >>>>> > > >>>>> Thanks, > > >>>>> > > >>>>> Fang, Yan > > >>>>> yanfang...@gmail.com > > >>>>> > > >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar < > mikesk...@gmail.com > > >>> > > >>>>> wrote: > > >>>>> > > >>>>>> Hi, > > >>>>>> > > >>>>>> What would be the best approach for doing "blocking" operations in > > >>>> Samza? > > >>>>>> > > >>>>>> For example, we have a kafka stream of urls for which we need to > > >>> gather > > >>>>>> external data via HTTP (such as alexa rank, get the page title and > > >>>>>> headers..). Other scenarios include database access and decision > > >>> making > > >>>>> via > > >>>>>> a rule engine. > > >>>>>> > > >>>>>> Samza processes messages in a singe thread, HTTP requests might > > >> take > > >>>>>> hundreds of miliseconds. With the single threaded design the > > >>> throughput > > >>>>>> would be very limited, which can be solved with an asynchronous > > >>>> approach. > > >>>>>> However Samza documentation explicitely states > > >>>>>> "*You are strongly discouraged from using threads in your job’s > > >>> code*". > > >>>>>> > > >>>>>> It seems that Samza design suits very well "data transformation" > > >>>>> scenarios, > > >>>>>> what is not clear is how well can it support external services? > > >>>>>> > > >>>>>> Thanks, > > >>>>>> Michael Sklyar > > > > > > > > > > > > -------------------------- > > Ken Krugler > > +1 530-210-6378 > > http://www.scaleunlimited.com > > custom big data solutions & training > > Hadoop, Cascading, Cassandra & Solr > > > > > > > > > > > > > > > -- > Jordan Shaw > Full Stack Software Engineer > PubNub Inc > 1045 17th St > San Francisco, CA 94107 > -- Navina R.