Kenn is correct. Allowing Fn reuse across bundles was a major, major
performance improvement. Profiling on the old Dataflow SDKs consistently
showed Java serialization being the number one performance bottleneck for
streaming pipelines, and Beam fixed this.

Romain - can you state precisely what you want? I do think there is still a
gap - IMO there's a place for a longer-lived per-fn container; evidence for
this is that people still often need to use statics to store things.
However I'm not sure if this is what you're looking for.

Reuven

On Fri, Feb 16, 2018 at 1:33 PM, Kenneth Knowles <k...@google.com> wrote:

> On Fri, Feb 16, 2018 at 1:00 PM, Romain Manni-Bucau <rmannibu...@gmail.com
> > wrote:
>>
>> The serialization of fn being once per bundle, the perf impact is only
>> huge if there is a bug somewhere else, even java serialization is
>> negligeable on big config compared to any small pipeline (seconds vs
>> minutes).
>>
>
> Profiling is clear that this is a huge performance impact. One of the most
> important backwards-incompatible changes we made for Beam 2.0.0 was to
> allow Fn reuse across bundles.
>
> When we used a DoFn only for one bundle, there was no @Teardown because it
> has ~no use. You do everything in @FinishBundle. So for whatever use case
> you are working on, if your pipeline performs well enough doing it per
> bundle, you can put it in @FinishBundle. Of course it still might not get
> called because that is a logical impossibility - you just know that for a
> given element the element will be retried if @FinishBundle fails.
>
> If you have cleanup logic that absolutely must get executed, then you need
> to build a composite PTransform around it so it will be retried until
> cleanup succeeds. In Beam's sinks you can find many examples.
>
> Kenn
>
>

Reply via email to