Thank you for your replies, I understand that making an external blocking request in a single event thread will result in extremely low throughput. However this can be solved by multi threading and/or asynchronous approach. It is clear that in any case using external services can never achieve the throughput of simple transformations. However most stream processing need, from time to time, to query some external storage, web service etc...
Do you mean that in such a case Samza should be combined with another Stream processing framework (such as Storm)? Navina, "the job bootstraps the data from the source" - do you mean that you have a background process for this purpose or just listen to an additional stream of change log from some other framework? Thanks, Michael On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh <nram...@linkedin.com.invalid > wrote: > Hi Michael, > I agree with what Yan said. While nothing stops you from doing it, it is > not encouraged as it affect throughput and realtime processing. > > {quote} > It seems that Samza design suits very well "data transformation" scenarios, > what is not clear is how well can it support external services? > {quote} > We have some similar use-cases at LinkedIn where the Samza jobs need to > query to external data sources. We do use a pattern where the job > bootstraps the data from the source using a change-capture system like > databus and buffer it locally, before processing from input streams. > Depending on the scale of your data, this model may or may not work for > you. However, there is no in-built support for this in Samza. > > Thanks! > Navina > > On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <yanfang...@gmail.com> wrote: > > > Hi Michael, > > > > Samza is designed for high-throughput and realtime processing. If you are > > using HTTP request/external service, you may not retrieve the same > > performance as not using it. However, technically speaking, there is > > nothing blocking you to do this, (well, discouraged anyway :). Samza by > > default does not provide this feature. So you maybe a little cautious > when > > implementing this. > > > > Thanks, > > > > Fang, Yan > > yanfang...@gmail.com > > > > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikesk...@gmail.com> > > wrote: > > > > > Hi, > > > > > > What would be the best approach for doing "blocking" operations in > Samza? > > > > > > For example, we have a kafka stream of urls for which we need to gather > > > external data via HTTP (such as alexa rank, get the page title and > > > headers..). Other scenarios include database access and decision making > > via > > > a rule engine. > > > > > > Samza processes messages in a singe thread, HTTP requests might take > > > hundreds of miliseconds. With the single threaded design the throughput > > > would be very limited, which can be solved with an asynchronous > approach. > > > However Samza documentation explicitely states > > > "*You are strongly discouraged from using threads in your job’s code*". > > > > > > It seems that Samza design suits very well "data transformation" > > scenarios, > > > what is not clear is how well can it support external services? > > > > > > Thanks, > > > Michael Sklyar > > > > > > > > > -- > Navina R. >