2018-02-18 19:19 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>: > FinishBundle has a stronger guarantee: if the pipeline succeeded, then it > has been called for every succeeded bundle, and succeeded bundles together > cover the entire input PCollection. Of course, it may not have been called > for failed bundles. > To anticipate a possible objection "why not also keep retrying Teardown > until it succeeds" - because if Teardown wasn't called on a DoFn instance, > it's because the instance no longer exists and there's nothing to call it > on. > > Please take a look at implementations of WriteFiles and BigQueryIO.read() > and write() to see how cleanup of heavyweight resources (large number of > temp files, temporary BigQuery datasets) can be achieved reliably to the > extent possible. >
Do you mean passing state accross the fn and having a fn responsible of the cleanup? Kind of making the teardown a processelement? This is a nice workaround but it is not always possible as mentionned. Ismael even has a nice case where this just fails and teardown would work - was with AWS, not a bigquery bug, but same design. > > On Sun, Feb 18, 2018 at 9:56 AM Romain Manni-Bucau <rmannibu...@gmail.com> > wrote: > >> 2018-02-18 18:36 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>: >> >>> "Machine state" is overly low-level because many of the possible reasons >>> can happen on a perfectly fine machine. >>> If you'd like to rephrase it to "it will be called except in various >>> situations where it's logically impossible or impractical to guarantee that >>> it's called", that's fine. Or you can list some of the examples above. >>> >> >> Sounds ok to me >> >> >>> >>> The main point for the user is, you *will* see non-preventable >>> situations where it couldn't be called - it's not just intergalactic >>> crashes - so if the logic is very important (e.g. cleaning up a large >>> amount of temporary files, shutting down a large number of VMs you started >>> etc), you have to express it using one of the other methods that have >>> stricter guarantees (which obviously come at a cost, e.g. no >>> pass-by-reference). >>> >> >> FinishBundle has the exact same guarantee sadly so not which which other >> method you speak about. Concretely if you make it really unreliable - this >> is what best effort sounds to me - then users can use it to clean anything >> but if you make it "can happen but it is unexpected and means something >> happent" then it is fine to have a manual - or auto if fancy - recovery >> procedure. This is where it makes all the difference and impacts the >> developpers, ops (all users basically). >> >> >>> >>> On Sun, Feb 18, 2018 at 9:16 AM Romain Manni-Bucau < >>> rmannibu...@gmail.com> wrote: >>> >>>> Agree Eugene except that "best effort" means that. It is also often >>>> used to say "at will" and this is what triggered this thread. >>>> >>>> I'm fine using "except if the machine state prevents it" but "best >>>> effort" is too open and can be very badly and wrongly perceived by users >>>> (like I did). >>>> >>>> >>>> Romain Manni-Bucau >>>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>>> <https://rmannibucau.metawerx.net/> | Old Blog >>>> <http://rmannibucau.wordpress.com> | Github >>>> <https://github.com/rmannibucau> | LinkedIn >>>> <https://www.linkedin.com/in/rmannibucau> | Book >>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >>>> >>>> 2018-02-18 18:13 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>: >>>> >>>>> It will not be called if it's impossible to call it: in the example >>>>> situation you have (intergalactic crash), and in a number of more common >>>>> cases: eg in case the worker container has crashed (eg user code in a >>>>> different thread called a C library over JNI and it segfaulted), JVM bug, >>>>> crash due to user code OOM, in case the worker has lost network >>>>> connectivity (then it may be called but it won't be able to do anything >>>>> useful), in case this is running on a preemptible VM and it was preempted >>>>> by the underlying cluster manager without notice or if the worker was too >>>>> busy with other stuff (eg calling other Teardown functions) until the >>>>> preemption timeout elapsed, in case the underlying hardware simply failed >>>>> (which happens quite often at scale), and in many other conditions. >>>>> >>>>> "Best effort" is the commonly used term to describe such behavior. >>>>> Please feel free to file bugs for cases where you observed a runner not >>>>> call Teardown in a situation where it was possible to call it but the >>>>> runner made insufficient effort. >>>>> >>>>> On Sun, Feb 18, 2018, 9:02 AM Romain Manni-Bucau < >>>>> rmannibu...@gmail.com> wrote: >>>>> >>>>>> 2018-02-18 18:00 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>: >>>>>> >>>>>>> >>>>>>> >>>>>>> On Sun, Feb 18, 2018, 2:06 AM Romain Manni-Bucau < >>>>>>> rmannibu...@gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Le 18 févr. 2018 00:23, "Kenneth Knowles" <k...@google.com> a >>>>>>>> écrit : >>>>>>>> >>>>>>>> On Sat, Feb 17, 2018 at 3:09 PM, Romain Manni-Bucau < >>>>>>>> rmannibu...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> If you give an example of a high-level need (e.g. "I'm trying to >>>>>>>>> write an IO for system $x and it requires the following >>>>>>>>> initialization and >>>>>>>>> the following cleanup logic and the following processing in between") >>>>>>>>> I'll >>>>>>>>> be better able to help you. >>>>>>>>> >>>>>>>>> >>>>>>>>> Take a simple example of a transform requiring a connection. Using >>>>>>>>> bundles is a perf killer since size is not controlled. Using teardown >>>>>>>>> doesnt allow you to release the connection since it is a best effort >>>>>>>>> thing. >>>>>>>>> Not releasing the connection makes you pay a lot - aws ;) - or >>>>>>>>> prevents you >>>>>>>>> to launch other processings - concurrent limit. >>>>>>>>> >>>>>>>> >>>>>>>> For this example @Teardown is an exact fit. If things die so badly >>>>>>>> that @Teardown is not called then nothing else can be called to close >>>>>>>> the >>>>>>>> connection either. What AWS service are you thinking of that stays >>>>>>>> open for >>>>>>>> a long time when everything at the other end has died? >>>>>>>> >>>>>>>> >>>>>>>> You assume connections are kind of stateless but some (proprietary) >>>>>>>> protocols requires some closing exchanges which are not only "im >>>>>>>> leaving". >>>>>>>> >>>>>>>> For aws i was thinking about starting some services - machines - on >>>>>>>> the fly in a pipeline startup and closing them at the end. If teardown >>>>>>>> is >>>>>>>> not called you leak machines and money. You can say it can be done >>>>>>>> another >>>>>>>> way...as the full pipeline ;). >>>>>>>> >>>>>>>> I dont want to be picky but if beam cant handle its components >>>>>>>> lifecycle it can be used at scale for generic pipelines and if bound to >>>>>>>> some particular IO. >>>>>>>> >>>>>>>> What does prevent to enforce teardown - ignoring the interstellar >>>>>>>> crash case which cant be handled by any human system? Nothing >>>>>>>> technically. >>>>>>>> Why do you push to not handle it? Is it due to some legacy code on >>>>>>>> dataflow >>>>>>>> or something else? >>>>>>>> >>>>>>> Teardown *is* already documented and implemented this way >>>>>>> (best-effort). So I'm not sure what kind of change you're asking for. >>>>>>> >>>>>> >>>>>> Remove "best effort" from the javadoc. If it is not call then it is a >>>>>> bug and we are done :). >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>>> Also what does it mean for the users? Direct runner does it so if a >>>>>>>> user udes the RI in test, he will get a different behavior in prod? >>>>>>>> Also >>>>>>>> dont forget the user doesnt know what the IOs he composes use so this >>>>>>>> is so >>>>>>>> impacting for the whole product than he must be handled IMHO. >>>>>>>> >>>>>>>> I understand the portability culture is new in big data world but >>>>>>>> it is not a reason to ignore what people did for years and do it wrong >>>>>>>> before doing right ;). >>>>>>>> >>>>>>>> My proposal is to list what can prevent to guarantee - in the >>>>>>>> normal IT conditions - the execution of teardown. Then we see if we can >>>>>>>> handle it and only if there is a technical reason we cant we make it >>>>>>>> experimental/unsupported in the api. I know spark and flink can, any >>>>>>>> unknown blocker for other runners? >>>>>>>> >>>>>>>> Technical note: even a kill should go through java shutdown hooks >>>>>>>> otherwise your environment (beam enclosing software) is fully >>>>>>>> unhandled and >>>>>>>> your overall system is uncontrolled. Only case where it is not true is >>>>>>>> when >>>>>>>> the software is always owned by a vendor and never installed on >>>>>>>> customer >>>>>>>> environment. In this case it belongd to the vendor to handle beam API >>>>>>>> and >>>>>>>> not to beam to adjust its API for a vendor - otherwise all unsupported >>>>>>>> features by one runner should be made optional right? >>>>>>>> >>>>>>>> All state is not about network, even in distributed systems so this >>>>>>>> is key to have an explicit and defined lifecycle. >>>>>>>> >>>>>>>> >>>>>>>> Kenn >>>>>>>> >>>>>>>> >>>>>>>> >>>>