Re: Advice on parallelizing network calls in DoFn

Romain Manni-Bucau Fri, 09 Mar 2018 23:07:26 -0800

Yes, callback based. Beam today is synchronous and until bundles+combines
are reactive friendly, beam will be synchronous whatever other parts do.
Becoming reactive will enable to manage the threading issues properly and
to have better scalability on the overall execution when remote IO are
involved.


However it requires to break source, sdf design to use completionstage - or
equivalent - to chain the processing properly and in an unified fashion.

Le 9 mars 2018 23:48, "Reuven Lax" <re...@google.com> a écrit :

If you're talking about reactive programming, at a certain level beam is
already reactive. Are you referring to a specific way of writing the code?


On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <re...@google.com> wrote:

> What do you mean by reactive?
>
> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> @Kenn: why not preferring to make beam reactive? Would alow to scale way
>> more without having to hardly synchronize multithreading. Elegant and
>> efficient :). Beam 3?
>>
>>
>> Le 9 mars 2018 22:49, "Kenneth Knowles" <k...@google.com> a écrit :
>>
>>> I will start with the "exciting futuristic" answer, which is that we
>>> envision the new DoFn to be able to provide an automatic ExecutorService
>>> parameters that you can use as you wish.
>>>
>>>     new DoFn<>() {
>>>       @ProcessElement
>>>       public void process(ProcessContext ctx, ExecutorService
>>> executorService) {
>>>           ... launch some futures, put them in instance vars ...
>>>       }
>>>
>>>       @FinishBundle
>>>       public void finish(...) {
>>>          ... block on futures, output results if appropriate ...
>>>       }
>>>     }
>>>
>>> This way, the Java SDK harness can use its overarching knowledge of what
>>> is going on in a computation to, for example, share a thread pool between
>>> different bits. This was one reason to delete IntraBundleParallelization -
>>> it didn't allow the runner and user code to properly manage how many things
>>> were going on concurrently. And mostly the runner should own parallelizing
>>> to max out cores and what user code needs is asynchrony hooks that can
>>> interact with that. However, this feature is not thoroughly considered. TBD
>>> how much the harness itself manages blocking on outstanding requests versus
>>> it being your responsibility in FinishBundle, etc.
>>>
>>> I haven't explored rolling your own here, if you are willing to do the
>>> knob tuning to get the threading acceptable for your particular use case.
>>> Perhaps someone else can weigh in.
>>>
>>> Kenn
>>>
>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge <josh.fe...@bounceexchange.com>
>>> wrote:
>>>
>>>> Hello all:
>>>>
>>>> Our team has a pipeline that make external network calls. These
>>>> pipelines are currently super slow, and the hypothesis is that they are
>>>> slow because we are not threading for our network calls. The github issue
>>>> below provides some discussion around this:
>>>>
>>>> https://github.com/apache/beam/pull/957
>>>>
>>>> In beam 1.0, there was IntraBundleParallelization, which helped with
>>>> this. However, this was removed because it didn't comply with a few BEAM
>>>> paradigms.
>>>>
>>>> Questions going forward:
>>>>
>>>> What is advised for jobs that make blocking network calls? It seems
>>>> bundling the elements into groups of size X prior to passing to the DoFn,
>>>> and managing the threading within the function might work. thoughts?
>>>> Are these types of jobs even suitable for beam?
>>>> Are there any plans to develop features that help with this?
>>>>
>>>> Thanks
>>>>
>>>

Re: Advice on parallelizing network calls in DoFn

Reply via email to