Re: Advice on parallelizing network calls in DoFn

Romain Manni-Bucau Sat, 10 Mar 2018 07:13:58 -0800

Yes, for the dofn for instance, instead of having
processcontext.element()=<T> you get a CompletionStage<T> and output gets
it as well.


This way you register an execution chain. Mixed with streams you get a big
data java 8/9/10 API which enabkes any connectivity in a wel performing
manner ;).

Le 10 mars 2018 13:56, "Reuven Lax" <re...@google.com> a écrit :

> So you mean the user should have a way of registering asynchronous
> activity with a callback (the callback must be registered with Beam,
> because Beam needs to know not to mark the element as done until all
> associated callbacks have completed). I think that's basically what Kenn
> was suggesting, unless I'm missing something.
>
>
> On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>
>> Yes, callback based. Beam today is synchronous and until bundles+combines
>> are reactive friendly, beam will be synchronous whatever other parts do.
>> Becoming reactive will enable to manage the threading issues properly and
>> to have better scalability on the overall execution when remote IO are
>> involved.
>>
>> However it requires to break source, sdf design to use completionstage -
>> or equivalent - to chain the processing properly and in an unified fashion.
>>
>> Le 9 mars 2018 23:48, "Reuven Lax" <re...@google.com> a écrit :
>>
>> If you're talking about reactive programming, at a certain level beam is
>> already reactive. Are you referring to a specific way of writing the code?
>>
>>
>> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <re...@google.com> wrote:
>>
>>> What do you mean by reactive?
>>>
>>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>>> wrote:
>>>
>>>> @Kenn: why not preferring to make beam reactive? Would alow to scale
>>>> way more without having to hardly synchronize multithreading. Elegant and
>>>> efficient :). Beam 3?
>>>>
>>>>
>>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <k...@google.com> a écrit :
>>>>
>>>>> I will start with the "exciting futuristic" answer, which is that we
>>>>> envision the new DoFn to be able to provide an automatic ExecutorService
>>>>> parameters that you can use as you wish.
>>>>>
>>>>>     new DoFn<>() {
>>>>>       @ProcessElement
>>>>>       public void process(ProcessContext ctx, ExecutorService
>>>>> executorService) {
>>>>>           ... launch some futures, put them in instance vars ...
>>>>>       }
>>>>>
>>>>>       @FinishBundle
>>>>>       public void finish(...) {
>>>>>          ... block on futures, output results if appropriate ...
>>>>>       }
>>>>>     }
>>>>>
>>>>> This way, the Java SDK harness can use its overarching knowledge of
>>>>> what is going on in a computation to, for example, share a thread pool
>>>>> between different bits. This was one reason to delete
>>>>> IntraBundleParallelization - it didn't allow the runner and user code to
>>>>> properly manage how many things were going on concurrently. And mostly the
>>>>> runner should own parallelizing to max out cores and what user code needs
>>>>> is asynchrony hooks that can interact with that. However, this feature is
>>>>> not thoroughly considered. TBD how much the harness itself manages 
>>>>> blocking
>>>>> on outstanding requests versus it being your responsibility in
>>>>> FinishBundle, etc.
>>>>>
>>>>> I haven't explored rolling your own here, if you are willing to do the
>>>>> knob tuning to get the threading acceptable for your particular use case.
>>>>> Perhaps someone else can weigh in.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge <
>>>>> josh.fe...@bounceexchange.com> wrote:
>>>>>
>>>>>> Hello all:
>>>>>>
>>>>>> Our team has a pipeline that make external network calls. These
>>>>>> pipelines are currently super slow, and the hypothesis is that they are
>>>>>> slow because we are not threading for our network calls. The github issue
>>>>>> below provides some discussion around this:
>>>>>>
>>>>>> https://github.com/apache/beam/pull/957
>>>>>>
>>>>>> In beam 1.0, there was IntraBundleParallelization, which helped with
>>>>>> this. However, this was removed because it didn't comply with a few BEAM
>>>>>> paradigms.
>>>>>>
>>>>>> Questions going forward:
>>>>>>
>>>>>> What is advised for jobs that make blocking network calls? It seems
>>>>>> bundling the elements into groups of size X prior to passing to the DoFn,
>>>>>> and managing the threading within the function might work. thoughts?
>>>>>> Are these types of jobs even suitable for beam?
>>>>>> Are there any plans to develop features that help with this?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>

Re: Advice on parallelizing network calls in DoFn

Reply via email to