Hi Michael (& Navina),

I don't think you need to create a separate background process, at least for 
the case of web crawling.

The challenge is to efficiently use one Samza process to simultaneously fetch 
many URLs.

Which does increase the complexity of that process's code, as you wind up 
having to manage either a multi-threaded or async fetch state.

But that's the same as for Hadoop-based crawlers, where you have a limited 
number of parallel reduce tasks that are doing the fetching - see Nutch and 
Bixo for examples, e.g. FetchBuffer.

And it's the same for storm-crawler, another project I've been involved with in 
the past.

-- Ken

> From: Michael Sklyar
> Sent: September 21, 2015 5:19:52am PDT
> To: dev@samza.apache.org
> Subject: Re: Asynchronous approach and samza
> 
> Thanks Navina,
> it is much more clear now.
> 
> Unfortunately, in our case, we can not bootstrap the data in advance(we
> can't pre-fetch all existing URL's titles and headers in advance).
> Sounds to me that, if we want to use Samza, we will need a background
> process that will be synchronized with the main event loop of the task
> (+hande back-pressure so not more than X requests can be made
> simultaneously).
> 
> 
> Regards,
> Michael
> 
> On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
> nram...@linkedin.com.invalid> wrote:
> 
>> Hi Michael,
>> {quote}
>> Do you mean that in such a case Samza should be combined with another
>> Stream processing framework (such as Storm)?
>> {quote}
>> No. I didn't mean combining it with any other framework.
>> 
>> {quote}
>> "the job bootstraps the data from the source" - do you mean that
>> you have a background process for this purpose or just listen to an
>> additional stream of change log from some other framework?
>> {quote}
>> I didn't mean a background process. I meant just listening from a stream of
>> change log from a data source.
>> 
>> At LinkedIn, we use databus. The jobs will configure databus (for a give
>> data source) as one of the input streams for the job. Databus is a source
>> agnostic distributed change data capture system. You can find more
>> information here <https://github.com/linkedin/databus>. The advantage is
>> that the databus client is capable of "bootstrapping" from the source
>> automatically and then, switching to simply capture changes from the data
>> source. In this scenario, Samza doesn't do anything special, except that it
>> will continue consuming from databus stream when bootstrapping. Once
>> bootstrap is complete, the job can start processing events from other input
>> streams as well.
>> 
>> I hope my explanation clarifies your question. :)
>> 
>> Thanks!
>> Navina
>> 
>> 
>> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mikesk...@gmail.com>
>> wrote:
>> 
>>> Thank you for your replies,
>>> 
>>> I understand that making an external blocking request in a single event
>>> thread will result in extremely low throughput. However this can be
>> solved
>>> by multi threading and/or asynchronous approach. It is clear that in any
>>> case using external services can never achieve the throughput of simple
>>> transformations. However most stream processing need, from time to time,
>> to
>>> query some external storage, web service etc...
>>> 
>>> Do you mean that in such a case Samza should be combined with another
>>> Stream processing framework (such as Storm)?
>>> 
>>> Navina, "the job bootstraps the data from the source" - do you mean that
>>> you have a background process for this purpose or just listen to an
>>> additional stream of change log from some other framework?
>>> 
>>> Thanks,
>>> Michael
>>> 
>>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
>>> <nram...@linkedin.com.invalid
>>>> wrote:
>>> 
>>>> Hi Michael,
>>>> I agree with what Yan said. While nothing stops you from doing it, it
>> is
>>>> not encouraged as it affect throughput and realtime processing.
>>>> 
>>>> {quote}
>>>> It seems that Samza design suits very well "data transformation"
>>> scenarios,
>>>> what is not clear is how well can it support external services?
>>>> {quote}
>>>> We have some similar use-cases at LinkedIn where the Samza jobs need to
>>>> query to external data sources. We do use a pattern where the job
>>>> bootstraps the data from the source using a change-capture system like
>>>> databus and buffer it locally, before processing from input streams.
>>>> Depending on the scale of your data, this model may or may not work for
>>>> you. However, there is no in-built support for this in Samza.
>>>> 
>>>> Thanks!
>>>> Navina
>>>> 
>>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> Samza is designed for high-throughput and realtime processing. If you
>>> are
>>>>> using HTTP request/external service, you may not retrieve the same
>>>>> performance as not using it. However, technically speaking, there is
>>>>> nothing blocking you to do this, (well, discouraged anyway :). Samza
>> by
>>>>> default does not provide this feature. So you maybe a little cautious
>>>> when
>>>>> implementing this.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Fang, Yan
>>>>> yanfang...@gmail.com
>>>>> 
>>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikesk...@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> What would be the best approach for doing "blocking" operations in
>>>> Samza?
>>>>>> 
>>>>>> For example, we have a kafka stream of urls for which we need to
>>> gather
>>>>>> external data via HTTP (such as alexa rank, get the page title and
>>>>>> headers..). Other scenarios include database access and decision
>>> making
>>>>> via
>>>>>> a rule engine.
>>>>>> 
>>>>>> Samza processes messages in a singe thread, HTTP requests might
>> take
>>>>>> hundreds of miliseconds. With the single threaded design the
>>> throughput
>>>>>> would be very limited, which can be solved with an asynchronous
>>>> approach.
>>>>>> However Samza documentation explicitely states
>>>>>> "*You are strongly discouraged from using threads in your job’s
>>> code*".
>>>>>> 
>>>>>> It seems that Samza design suits very well "data transformation"
>>>>> scenarios,
>>>>>> what is not clear is how well can it support external services?
>>>>>> 
>>>>>> Thanks,
>>>>>> Michael Sklyar





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to