2018-02-16 17:59 GMT+01:00 Kenneth Knowles <k...@google.com>: > What I am hearing is this: > > - @FinishBundle does what you want (a reliable "flush" call) but your > runner is not doing a good job of bundling >
Nop, finishbundle is defined but not a bundle. Typically for 1 million rows I'll get 1 million calls in flink and 1 call in spark (today) so this is not a way to call a final task to release dofn internal instances or do some one time auditing. > - @Teardown has well-defined semantics and they are not what you want > " Note that calls to the annotated method are best effort, and may not occur for arbitrary reasons" is not really "well-defined" and is also a breaking change compared to the < 2.3.x (x >= 1) . > So you are hoping for something that is called less frequently but is > still mandatory. > > Just trying to establish the basics to start over and get this on track to > solving the real problem. > Concretely I need a well defined lifecycle for any DoFn executed in beam and today there is no such a thing making it impossible to develop correctly transforms/fn on an user side. > > Kenn > > > On Fri, Feb 16, 2018 at 8:51 AM, Romain Manni-Bucau <rmannibu...@gmail.com > > wrote: > >> finish bundle is well defined and must be called, right, not at the end >> so you still miss teardown as a user. Bundles are defined by the runner and >> you can have 100000 bundles per batch (even more for a stream ;)) so you >> dont want to release your resources or handle you execution auditing in it, >> you want it at the end so in tear down. >> >> So yes we must have teardown reliable somehow. >> >> >> Romain Manni-Bucau >> @rmannibucau <https://twitter.com/rmannibucau> | Blog >> <https://rmannibucau.metawerx.net/> | Old Blog >> <http://rmannibucau.wordpress.com> | Github >> <https://github.com/rmannibucau> | LinkedIn >> <https://www.linkedin.com/in/rmannibucau> | Book >> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >> >> 2018-02-16 17:43 GMT+01:00 Reuven Lax <re...@google.com>: >> >>> +1 I think @FinishBundle is the right thing to look at here. >>> >>> On Fri, Feb 16, 2018, 8:41 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >>>> Hi Romain >>>> >>>> Is it not @FinishBundle your solution ? >>>> >>>> Regards >>>> JB >>>> Le 16 févr. 2018, à 17:06, Romain Manni-Bucau <rmannibu...@gmail.com> >>>> a écrit: >>>>> >>>>> I see Reuven, so it is actually a broken contract for end users more >>>>> than a bug. Concretely a user must have a way to execute code once the >>>>> teardown is no more used and a teardown is populated by the user in the >>>>> context of an execution. >>>>> It means that if the environment wants to pool (cache) the instances >>>>> it must provide a postBorrowFromCache and preReturnToCache to let the user >>>>> handle that - we'll get back to EJB and passivation ;). >>>>> >>>>> Personally I think it is fine to cache the instances for the duration >>>>> of an execution but not accross execution. Concretely if you check out the >>>>> API it should just not be possible for a runner since the lifecycle is not >>>>> covered and the fact teardown can not be called today is an implementation >>>>> bug/leak surfacing the API. >>>>> >>>>> So I see 2 options: >>>>> >>>>> 1. make it mandatory and get rid of the caching - which shouldnt help >>>>> much in current state in terms of perf >>>>> 2. keep teardown a final release object (which is not that useful >>>>> cause of the end of the sentence) and add a clean cache lifecycle >>>>> management >>>>> >>>>> tempted to say 1 is saner short terms, in particular cause beam is 2.x >>>>> and users already use it this way. >>>>> >>>>> >>>>> >>>>> Romain Manni-Bucau >>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>> <http://rmannibucau.wordpress.com> | Github >>>>> <https://github.com/rmannibucau> | LinkedIn >>>>> <https://www.linkedin.com/in/rmannibucau> | Book >>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >>>>> >>>>> 2018-02-16 16:53 GMT+01:00 Reuven Lax <re...@google.com>: >>>>> >>>>>> So the concern is that @TearDown might not be called? >>>>>> >>>>>> Let's understand the reason for @TearDown. The runner is free to >>>>>> cache the DoFn object across many invocations, and indeed in streaming >>>>>> this >>>>>> is often a critical optimization. However if the runner does decide to >>>>>> destroy the DoFn object (e.g. because it's being evicted from cache), >>>>>> often >>>>>> users need a callback to tear down associated resources (file handles, >>>>>> RPC >>>>>> connections, etc.). >>>>>> >>>>>> Now @TearDown isn't guaranteed to be called for a simple reason: the >>>>>> runner might never tear down the DoFn object! The runner might well >>>>>> decide >>>>>> to cache the object forever, in which case there is never .a time to call >>>>>> @TearDown. There is no violation of semantics here. >>>>>> >>>>>> Also, the point about not calling teardown if the JVM crashes might >>>>>> well sound implicit with no need to mention it. However empirically users >>>>>> do misunderstand even this, so it's worth mentioning. >>>>>> >>>>>> Reuven >>>>>> >>>>>> On Fri, Feb 16, 2018 at 2:11 AM, Romain Manni-Bucau < >>>>>> rmannibu...@gmail.com> wrote: >>>>>> >>>>>>> Hi guys, >>>>>>> >>>>>>> I'm a bit concerned of this PR https://github.com/apache/beam >>>>>>> /pull/4637 >>>>>>> >>>>>>> I understand the intent but I'd like to share how I see it and why >>>>>>> it is an issue for me: >>>>>>> >>>>>>> 1. you can't help if the JVM crash in any case. Tomcat had a try to >>>>>>> preallocate some memory for instance to free it in case of OOME and try >>>>>>> to >>>>>>> recover but it never prooved useful and got dropped recently. This is a >>>>>>> good example you can't do anything if there is a cataclism and therefore >>>>>>> any framework or lib will not be blamed for it >>>>>>> 2. if you expose an API, its behavior must be well defined. In the >>>>>>> case of a portable library like Beam it is even more important >>>>>>> otherwise it >>>>>>> leads users to not use the API or the projet :(. >>>>>>> >>>>>>> >>>>>>> These two points lead to say that if the JVM crashes it is ok to not >>>>>>> call teardown and it is even implicit in any programming environment so >>>>>>> no >>>>>>> need to mention it. However that a runner doesn't call teardown is a bug >>>>>>> and not a feature or something intended because it can have a huge >>>>>>> impact >>>>>>> on the user flow. >>>>>>> >>>>>>> The user workarounds are to use custom threads with timeouts to >>>>>>> execute the actions or things like that, all bad solutions to replace a >>>>>>> buggy API, if you remove the contract guarantee. >>>>>>> >>>>>>> To make it obvious: substring(from, to): will substring the current >>>>>>> string between from and to...or not. Would you use the function? >>>>>>> >>>>>>> What I ask is to add in the javadoc that the contract enforces the >>>>>>> runner to call that. Which means the core doesn't guarantee it but >>>>>>> imposes >>>>>>> the runner to do so. This way the not portable behavior is where it >>>>>>> belongs >>>>>>> to, in the vendor specific code. It leads to a reliable API for the end >>>>>>> user and let runners document they don't respect - yet - the API when >>>>>>> relevant. >>>>>>> >>>>>>> wdyt? >>>>>>> >>>>>>> Romain Manni-Bucau >>>>>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>>>>> <http://rmannibucau.wordpress.com> | Github >>>>>>> <https://github.com/rmannibucau> | LinkedIn >>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book >>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >>>>>>> >>>>>> >>>>>> >>>>> >> >