Given that I'm the original author of both the @Setup and @Teardown methods and the PR under discussion, I thought I'd drop in to give in a bit of history and my thoughts on the issue.
Originally (Dataflow 1.x), the spec required a Runner to deserialize a new instance of a DoFn for every Bundle. For runners with small bundles (such as the Dataflow Runner in streaming mode), this can be a significant cost, and enabling DoFn reuse enables significant performance increases, as Reuven mentioned. However, for IOs and other transforms which have to perform expensive execution-time initialization, DoFn reuse isn't enough by itself to enable effective resource reuse - instead, we introduced the @Setup and @Teardown methods, which exist to manage long-lived resources for a DoFn, such as external connections. However, these methods are bound to the lifetime of the DoFn, not to the lifetime of any specific elements. As a result, there are two families of methods on a DoFn: element-related and instance-related methods. @StartBundle, @ProcessElement, and @FinishBundle are all the prior - they are guaranteed to be called exactly once per input element, as observed by the Pipeline (exactly one _logical_ invocation, one or more physical invocations). @Setup and @Teardown are unrelated to this lifecycle, but because we need to be able to successfully call @Setup before performing any processing method, we can trivially guarantee that Setup will have been called before processing any elements. For any behavior that is required for the correct functioning of a pipeline, such as flushing buffered side effects (writes to an external system, etc), use of @Teardown is *never appropriate* - because it is unrelated to the processing of elements, if that buffer is lost for any reason, the elements that produced it will never be reprocessed by the Pipeline (it is permanently lost, which is not the case for anything performed in @FinishBundle). For long-lived resources, those will be bound to the life of the DoFn. In general the runner is required to tear down an instance if it is going to continue execution but discard that instance of the DoFn. However, that's not the only way a runner can discard of a DoFn - for example, it can reclaim the container the DoFn executes on, or kill the JVM and restart it, or just never discard it and use it forever, and be well-behaved. The additional documentation exists to make it obvious that performing anything based on elements within Teardown is extremely unsafe and is a good way to produce inconsistent results or lose data. I'll note as well that this is behavior that has always been the case - StartBundle, ProcessElement, and FinishBundle will always be called logically once per element, but Teardown was always called [0, 1] times per element, and this is an update only to the documentation and not to the actual behavior of any runner. On Fri, Feb 16, 2018 at 8:44 AM Reuven Lax <re...@google.com> wrote: > +1 I think @FinishBundle is the right thing to look at here. > > On Fri, Feb 16, 2018, 8:41 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> Hi Romain >> >> Is it not @FinishBundle your solution ? >> >> Regards >> JB >> Le 16 févr. 2018, à 17:06, Romain Manni-Bucau <rmannibu...@gmail.com> a >> écrit: >>> >>> I see Reuven, so it is actually a broken contract for end users more >>> than a bug. Concretely a user must have a way to execute code once the >>> teardown is no more used and a teardown is populated by the user in the >>> context of an execution. >>> It means that if the environment wants to pool (cache) the instances it >>> must provide a postBorrowFromCache and preReturnToCache to let the user >>> handle that - we'll get back to EJB and passivation ;). >>> >>> Personally I think it is fine to cache the instances for the duration of >>> an execution but not accross execution. Concretely if you check out the API >>> it should just not be possible for a runner since the lifecycle is not >>> covered and the fact teardown can not be called today is an implementation >>> bug/leak surfacing the API. >>> >>> So I see 2 options: >>> >>> 1. make it mandatory and get rid of the caching - which shouldnt help >>> much in current state in terms of perf >>> 2. keep teardown a final release object (which is not that useful cause >>> of the end of the sentence) and add a clean cache lifecycle management >>> >>> tempted to say 1 is saner short terms, in particular cause beam is 2.x >>> and users already use it this way. >>> >>> >>> >>> Romain Manni-Bucau >>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>> <https://rmannibucau.metawerx.net/> | Old Blog >>> <http://rmannibucau.wordpress.com> | Github >>> <https://github.com/rmannibucau> | LinkedIn >>> <https://www.linkedin.com/in/rmannibucau> | Book >>> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >>> >>> 2018-02-16 16:53 GMT+01:00 Reuven Lax <re...@google.com>: >>> >>>> So the concern is that @TearDown might not be called? >>>> >>>> Let's understand the reason for @TearDown. The runner is free to cache >>>> the DoFn object across many invocations, and indeed in streaming this is >>>> often a critical optimization. However if the runner does decide to destroy >>>> the DoFn object (e.g. because it's being evicted from cache), often users >>>> need a callback to tear down associated resources (file handles, RPC >>>> connections, etc.). >>>> >>>> Now @TearDown isn't guaranteed to be called for a simple reason: the >>>> runner might never tear down the DoFn object! The runner might well decide >>>> to cache the object forever, in which case there is never .a time to call >>>> @TearDown. There is no violation of semantics here. >>>> >>>> Also, the point about not calling teardown if the JVM crashes might >>>> well sound implicit with no need to mention it. However empirically users >>>> do misunderstand even this, so it's worth mentioning. >>>> >>>> Reuven >>>> >>>> On Fri, Feb 16, 2018 at 2:11 AM, Romain Manni-Bucau < >>>> rmannibu...@gmail.com> wrote: >>>> >>>>> Hi guys, >>>>> >>>>> I'm a bit concerned of this PR >>>>> https://github.com/apache/beam/pull/4637 >>>>> >>>>> I understand the intent but I'd like to share how I see it and why it >>>>> is an issue for me: >>>>> >>>>> 1. you can't help if the JVM crash in any case. Tomcat had a try to >>>>> preallocate some memory for instance to free it in case of OOME and try to >>>>> recover but it never prooved useful and got dropped recently. This is a >>>>> good example you can't do anything if there is a cataclism and therefore >>>>> any framework or lib will not be blamed for it >>>>> 2. if you expose an API, its behavior must be well defined. In the >>>>> case of a portable library like Beam it is even more important otherwise >>>>> it >>>>> leads users to not use the API or the projet :(. >>>>> >>>>> >>>>> These two points lead to say that if the JVM crashes it is ok to not >>>>> call teardown and it is even implicit in any programming environment so no >>>>> need to mention it. However that a runner doesn't call teardown is a bug >>>>> and not a feature or something intended because it can have a huge impact >>>>> on the user flow. >>>>> >>>>> The user workarounds are to use custom threads with timeouts to >>>>> execute the actions or things like that, all bad solutions to replace a >>>>> buggy API, if you remove the contract guarantee. >>>>> >>>>> To make it obvious: substring(from, to): will substring the current >>>>> string between from and to...or not. Would you use the function? >>>>> >>>>> What I ask is to add in the javadoc that the contract enforces the >>>>> runner to call that. Which means the core doesn't guarantee it but imposes >>>>> the runner to do so. This way the not portable behavior is where it >>>>> belongs >>>>> to, in the vendor specific code. It leads to a reliable API for the end >>>>> user and let runners document they don't respect - yet - the API when >>>>> relevant. >>>>> >>>>> wdyt? >>>>> >>>>> Romain Manni-Bucau >>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>> <http://rmannibucau.wordpress.com> | Github >>>>> <https://github.com/rmannibucau> | LinkedIn >>>>> <https://www.linkedin.com/in/rmannibucau> | Book >>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >>>>> >>>> >>>> >>>