Re: @TearDown guarantees

Romain Manni-Bucau Sat, 17 Feb 2018 08:59:48 -0800

Assuming a Pipeline.run(); the corresponding sequence:

WorkerStartFn();
WorkerEndFn();


So a single instance of the fn for the full pipeline execution.

Le 17 févr. 2018 17:42, "Reuven Lax" <[email protected]> a écrit :

> " and a transform is by design bound to an execution"
>
> What do you mean by execution?
>
> On Sat, Feb 17, 2018 at 12:50 AM, Romain Manni-Bucau <
> [email protected]> wrote:
>
>>
>>
>> Le 16 févr. 2018 22:41, "Reuven Lax" <[email protected]> a écrit :
>>
>> Kenn is correct. Allowing Fn reuse across bundles was a major, major
>> performance improvement. Profiling on the old Dataflow SDKs consistently
>> showed Java serialization being the number one performance bottleneck for
>> streaming pipelines, and Beam fixed this.
>>
>>
>> Sorry but this doesnt help me much to understand. Let me try to explain.
>> I read it as "we were slow somehow around serialization so a quick fix was
>> caching".
>>
>> It is not to be picky but i had a lot of remote ejb over rmi super fast
>> setup do java serialization is slower than alternative serialization,
>> right, but doesnt justify caching most of the time.
>>
>> My main interrogation is: isnt beam which is designed to be slow in the
>> way it designed the dofn/transform and therefore serializes way more than
>> it requires - you never care to serialize the full transform and can in 95%
>> do a writeReplace which is light and fast compared to the default.
>>
>> If so the cache is an implementation workaround and not a fix.
>>
>> Hope my view is clearer on it.
>>
>>
>>
>> Romain - can you state precisely what you want? I do think there is still
>> a gap - IMO there's a place for a longer-lived per-fn container; evidence
>> for this is that people still often need to use statics to store things.
>> However I'm not sure if this is what you're looking for.
>>
>>
>> Yes. I build a framework on top of beam and must be able to provide a
>> lifecycle clear and reliable. The bare minimum for any user is
>> start-exec-stop and a transform is by design bound to an execution (stream
>> or batch).
>>
>> Bundles are not an option as explained cause not bound to the execution
>> but an uncontrolled subpart. You can see it as a beam internal until
>> runners unify this definition. And in any case it is closer to a chunk
>> notion than a lifecycle one.
>>
>> So setup and teardown must be symmetric.
>>
>> Note that a dofn instance owns a config so is bound to an execution.
>>
>> This all lead to the nees of a reliable teardown.
>>
>> Caching can be neat bit requires it own api like passivation one of ejbs.
>>
>>
>>
>> Reuven
>>
>> On Fri, Feb 16, 2018 at 1:33 PM, Kenneth Knowles <[email protected]> wrote:
>>
>>> On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau <
>>> [email protected]> wrote:
>>>>
>>>> The serialization of fn being once per bundle, the perf impact is only
>>>> huge if there is a bug somewhere else, even java serialization is
>>>> negligeable on big config compared to any small pipeline (seconds vs
>>>> minutes).
>>>>
>>>
>>> Profiling is clear that this is a huge performance impact. One of the
>>> most important backwards-incompatible changes we made for Beam 2.0.0 was to
>>> allow Fn reuse across bundles.
>>>
>>> When we used a DoFn only for one bundle, there was no @Teardown because
>>> it has ~no use. You do everything in @FinishBundle. So for whatever use
>>> case you are working on, if your pipeline performs well enough doing it per
>>> bundle, you can put it in @FinishBundle. Of course it still might not get
>>> called because that is a logical impossibility - you just know that for a
>>> given element the element will be retried if @FinishBundle fails.
>>>
>>> If you have cleanup logic that absolutely must get executed, then you
>>> need to build a composite PTransform around it so it will be retried until
>>> cleanup succeeds. In Beam's sinks you can find many examples.
>>>
>>> Kenn
>>>
>>>
>>
>>
>

Re: @TearDown guarantees

Reply via email to