@Ken: I was going to suggest batch processing with Samza, which is pretty
much what you just said. Thanks for your valuable input. :)

@Michael: I think the pattern I suggested will not work out for your data
scale. Following a batch processing model with Samza can fulfill the
requirements of your use-case.

Cheers!
Navina

On Mon, Sep 21, 2015 at 10:28 PM, Jordan Shaw <jor...@pubnub.com> wrote:

> Michael,
> Why not just have a pool of workers outside of Samza that are pushing the
> raw, or subset of the raw crawler input into a Kafka topic then have the
> Samza do the compute/stream work? Basically Samza is not the right tool for
> what your suggesting but could be used for downstream work, in my opinion.
> -Jordan
>
> On Mon, Sep 21, 2015 at 9:08 AM, Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
>
> > Hi Michael (& Navina),
> >
> > I don't think you need to create a separate background process, at least
> > for the case of web crawling.
> >
> > The challenge is to efficiently use one Samza process to simultaneously
> > fetch many URLs.
> >
> > Which does increase the complexity of that process's code, as you wind up
> > having to manage either a multi-threaded or async fetch state.
> >
> > But that's the same as for Hadoop-based crawlers, where you have a
> limited
> > number of parallel reduce tasks that are doing the fetching - see Nutch
> and
> > Bixo for examples, e.g. FetchBuffer.
> >
> > And it's the same for storm-crawler, another project I've been involved
> > with in the past.
> >
> > -- Ken
> >
> > > From: Michael Sklyar
> > > Sent: September 21, 2015 5:19:52am PDT
> > > To: dev@samza.apache.org
> > > Subject: Re: Asynchronous approach and samza
> > >
> > > Thanks Navina,
> > > it is much more clear now.
> > >
> > > Unfortunately, in our case, we can not bootstrap the data in advance(we
> > > can't pre-fetch all existing URL's titles and headers in advance).
> > > Sounds to me that, if we want to use Samza, we will need a background
> > > process that will be synchronized with the main event loop of the task
> > > (+hande back-pressure so not more than X requests can be made
> > > simultaneously).
> > >
> > >
> > > Regards,
> > > Michael
> > >
> > > On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
> > > nram...@linkedin.com.invalid> wrote:
> > >
> > >> Hi Michael,
> > >> {quote}
> > >> Do you mean that in such a case Samza should be combined with another
> > >> Stream processing framework (such as Storm)?
> > >> {quote}
> > >> No. I didn't mean combining it with any other framework.
> > >>
> > >> {quote}
> > >> "the job bootstraps the data from the source" - do you mean that
> > >> you have a background process for this purpose or just listen to an
> > >> additional stream of change log from some other framework?
> > >> {quote}
> > >> I didn't mean a background process. I meant just listening from a
> > stream of
> > >> change log from a data source.
> > >>
> > >> At LinkedIn, we use databus. The jobs will configure databus (for a
> give
> > >> data source) as one of the input streams for the job. Databus is a
> > source
> > >> agnostic distributed change data capture system. You can find more
> > >> information here <https://github.com/linkedin/databus>. The advantage
> > is
> > >> that the databus client is capable of "bootstrapping" from the source
> > >> automatically and then, switching to simply capture changes from the
> > data
> > >> source. In this scenario, Samza doesn't do anything special, except
> > that it
> > >> will continue consuming from databus stream when bootstrapping. Once
> > >> bootstrap is complete, the job can start processing events from other
> > input
> > >> streams as well.
> > >>
> > >> I hope my explanation clarifies your question. :)
> > >>
> > >> Thanks!
> > >> Navina
> > >>
> > >>
> > >> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mikesk...@gmail.com>
> > >> wrote:
> > >>
> > >>> Thank you for your replies,
> > >>>
> > >>> I understand that making an external blocking request in a single
> event
> > >>> thread will result in extremely low throughput. However this can be
> > >> solved
> > >>> by multi threading and/or asynchronous approach. It is clear that in
> > any
> > >>> case using external services can never achieve the throughput of
> simple
> > >>> transformations. However most stream processing need, from time to
> > time,
> > >> to
> > >>> query some external storage, web service etc...
> > >>>
> > >>> Do you mean that in such a case Samza should be combined with another
> > >>> Stream processing framework (such as Storm)?
> > >>>
> > >>> Navina, "the job bootstraps the data from the source" - do you mean
> > that
> > >>> you have a background process for this purpose or just listen to an
> > >>> additional stream of change log from some other framework?
> > >>>
> > >>> Thanks,
> > >>> Michael
> > >>>
> > >>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> > >>> <nram...@linkedin.com.invalid
> > >>>> wrote:
> > >>>
> > >>>> Hi Michael,
> > >>>> I agree with what Yan said. While nothing stops you from doing it,
> it
> > >> is
> > >>>> not encouraged as it affect throughput and realtime processing.
> > >>>>
> > >>>> {quote}
> > >>>> It seems that Samza design suits very well "data transformation"
> > >>> scenarios,
> > >>>> what is not clear is how well can it support external services?
> > >>>> {quote}
> > >>>> We have some similar use-cases at LinkedIn where the Samza jobs need
> > to
> > >>>> query to external data sources. We do use a pattern where the job
> > >>>> bootstraps the data from the source using a change-capture system
> like
> > >>>> databus and buffer it locally, before processing from input streams.
> > >>>> Depending on the scale of your data, this model may or may not work
> > for
> > >>>> you. However, there is no in-built support for this in Samza.
> > >>>>
> > >>>> Thanks!
> > >>>> Navina
> > >>>>
> > >>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> Hi Michael,
> > >>>>>
> > >>>>> Samza is designed for high-throughput and realtime processing. If
> you
> > >>> are
> > >>>>> using HTTP request/external service, you may not retrieve the same
> > >>>>> performance as not using it. However, technically speaking, there
> is
> > >>>>> nothing blocking you to do this, (well, discouraged anyway :).
> Samza
> > >> by
> > >>>>> default does not provide this feature. So you maybe a little
> cautious
> > >>>> when
> > >>>>> implementing this.
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Fang, Yan
> > >>>>> yanfang...@gmail.com
> > >>>>>
> > >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <
> mikesk...@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> What would be the best approach for doing "blocking" operations in
> > >>>> Samza?
> > >>>>>>
> > >>>>>> For example, we have a kafka stream of urls for which we need to
> > >>> gather
> > >>>>>> external data via HTTP (such as alexa rank, get the page title and
> > >>>>>> headers..). Other scenarios include database access and decision
> > >>> making
> > >>>>> via
> > >>>>>> a rule engine.
> > >>>>>>
> > >>>>>> Samza processes messages in a singe thread, HTTP requests might
> > >> take
> > >>>>>> hundreds of miliseconds. With the single threaded design the
> > >>> throughput
> > >>>>>> would be very limited, which can be solved with an asynchronous
> > >>>> approach.
> > >>>>>> However Samza documentation explicitely states
> > >>>>>> "*You are strongly discouraged from using threads in your job’s
> > >>> code*".
> > >>>>>>
> > >>>>>> It seems that Samza design suits very well "data transformation"
> > >>>>> scenarios,
> > >>>>>> what is not clear is how well can it support external services?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Michael Sklyar
> >
> >
> >
> >
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
> >
>
>
> --
> Jordan Shaw
> Full Stack Software Engineer
> PubNub Inc
> 1045 17th St
> San Francisco, CA 94107
>



-- 
Navina R.

Reply via email to