Re: Advice on parallelizing network calls in DoFn

Romain Manni-Bucau Sat, 10 Mar 2018 07:24:08 -0800

2018-03-10 16:19 GMT+01:00 Reuven Lax <[email protected]>:

> This is another version (maybe a better, Java 8 idiomatic one?) of what
> Kenn suggested.
>
> Note that with NewDoFn this need not be incompatible (so might not require
> waiting till Beam 3.0). We can recognize new parameters to processElement
> and populate add needed.
>


This is right however in my head it was a single way movemenent to enforce
the design to be reactive and not fake a reactive API with a sync and not
reactive impl which is what would be done today with both support I fear.


>
> On Sat, Mar 10, 2018, 12:13 PM Romain Manni-Bucau <[email protected]>
> wrote:
>
>> Yes, for the dofn for instance, instead of having
>> processcontext.element()=<T> you get a CompletionStage<T> and output gets
>> it as well.
>>
>> This way you register an execution chain. Mixed with streams you get a
>> big data java 8/9/10 API which enabkes any connectivity in a wel performing
>> manner ;).
>>
>> Le 10 mars 2018 13:56, "Reuven Lax" <[email protected]> a écrit :
>>
>>> So you mean the user should have a way of registering asynchronous
>>> activity with a callback (the callback must be registered with Beam,
>>> because Beam needs to know not to mark the element as done until all
>>> associated callbacks have completed). I think that's basically what Kenn
>>> was suggesting, unless I'm missing something.
>>>
>>>
>>> On Fri, Mar 9, 2018 at 11:07 PM Romain Manni-Bucau <
>>> [email protected]> wrote:
>>>
>>>> Yes, callback based. Beam today is synchronous and until
>>>> bundles+combines are reactive friendly, beam will be synchronous whatever
>>>> other parts do. Becoming reactive will enable to manage the threading
>>>> issues properly and to have better scalability on the overall execution
>>>> when remote IO are involved.
>>>>
>>>> However it requires to break source, sdf design to use completionstage
>>>> - or equivalent - to chain the processing properly and in an unified
>>>> fashion.
>>>>
>>>> Le 9 mars 2018 23:48, "Reuven Lax" <[email protected]> a écrit :
>>>>
>>>> If you're talking about reactive programming, at a certain level beam
>>>> is already reactive. Are you referring to a specific way of writing the
>>>> code?
>>>>
>>>>
>>>> On Fri, Mar 9, 2018 at 1:59 PM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> What do you mean by reactive?
>>>>>
>>>>> On Fri, Mar 9, 2018, 6:58 PM Romain Manni-Bucau <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> @Kenn: why not preferring to make beam reactive? Would alow to scale
>>>>>> way more without having to hardly synchronize multithreading. Elegant and
>>>>>> efficient :). Beam 3?
>>>>>>
>>>>>>
>>>>>> Le 9 mars 2018 22:49, "Kenneth Knowles" <[email protected]> a écrit :
>>>>>>
>>>>>>> I will start with the "exciting futuristic" answer, which is that we
>>>>>>> envision the new DoFn to be able to provide an automatic ExecutorService
>>>>>>> parameters that you can use as you wish.
>>>>>>>
>>>>>>>     new DoFn<>() {
>>>>>>>       @ProcessElement
>>>>>>>       public void process(ProcessContext ctx, ExecutorService
>>>>>>> executorService) {
>>>>>>>           ... launch some futures, put them in instance vars ...
>>>>>>>       }
>>>>>>>
>>>>>>>       @FinishBundle
>>>>>>>       public void finish(...) {
>>>>>>>          ... block on futures, output results if appropriate ...
>>>>>>>       }
>>>>>>>     }
>>>>>>>
>>>>>>> This way, the Java SDK harness can use its overarching knowledge of
>>>>>>> what is going on in a computation to, for example, share a thread pool
>>>>>>> between different bits. This was one reason to delete
>>>>>>> IntraBundleParallelization - it didn't allow the runner and user code to
>>>>>>> properly manage how many things were going on concurrently. And mostly 
>>>>>>> the
>>>>>>> runner should own parallelizing to max out cores and what user code 
>>>>>>> needs
>>>>>>> is asynchrony hooks that can interact with that. However, this feature 
>>>>>>> is
>>>>>>> not thoroughly considered. TBD how much the harness itself manages 
>>>>>>> blocking
>>>>>>> on outstanding requests versus it being your responsibility in
>>>>>>> FinishBundle, etc.
>>>>>>>
>>>>>>> I haven't explored rolling your own here, if you are willing to do
>>>>>>> the knob tuning to get the threading acceptable for your particular use
>>>>>>> case. Perhaps someone else can weigh in.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Fri, Mar 9, 2018 at 1:38 PM Josh Ferge <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hello all:
>>>>>>>>
>>>>>>>> Our team has a pipeline that make external network calls. These
>>>>>>>> pipelines are currently super slow, and the hypothesis is that they are
>>>>>>>> slow because we are not threading for our network calls. The github 
>>>>>>>> issue
>>>>>>>> below provides some discussion around this:
>>>>>>>>
>>>>>>>> https://github.com/apache/beam/pull/957
>>>>>>>>
>>>>>>>> In beam 1.0, there was IntraBundleParallelization, which helped
>>>>>>>> with this. However, this was removed because it didn't comply with a 
>>>>>>>> few
>>>>>>>> BEAM paradigms.
>>>>>>>>
>>>>>>>> Questions going forward:
>>>>>>>>
>>>>>>>> What is advised for jobs that make blocking network calls? It seems
>>>>>>>> bundling the elements into groups of size X prior to passing to the 
>>>>>>>> DoFn,
>>>>>>>> and managing the threading within the function might work. thoughts?
>>>>>>>> Are these types of jobs even suitable for beam?
>>>>>>>> Are there any plans to develop features that help with this?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>
>>>>

Re: Advice on parallelizing network calls in DoFn

Reply via email to