Re: @TearDown guarantees

Eugene Kirpichov Sun, 18 Feb 2018 09:37:19 -0800

"Machine state" is overly low-level because many of the possible reasons
can happen on a perfectly fine machine.
If you'd like to rephrase it to "it will be called except in various
situations where it's logically impossible or impractical to guarantee that
it's called", that's fine. Or you can list some of the examples above.


The main point for the user is, you *will* see non-preventable situations
where it couldn't be called - it's not just intergalactic crashes - so if
the logic is very important (e.g. cleaning up a large amount of temporary
files, shutting down a large number of VMs you started etc), you have to
express it using one of the other methods that have stricter guarantees
(which obviously come at a cost, e.g. no pass-by-reference).

On Sun, Feb 18, 2018 at 9:16 AM Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Agree Eugene except that "best effort" means that. It is also often used
> to say "at will" and this is what triggered this thread.
>
> I'm fine using "except if the machine state prevents it" but "best effort"
> is too open and can be very badly and wrongly perceived by users (like I
> did).
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-18 18:13 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>:
>
>> It will not be called if it's impossible to call it: in the example
>> situation you have (intergalactic crash), and in a number of more common
>> cases: eg in case the worker container has crashed (eg user code in a
>> different thread called a C library over JNI and it segfaulted), JVM bug,
>> crash due to user code OOM, in case the worker has lost network
>> connectivity (then it may be called but it won't be able to do anything
>> useful), in case this is running on a preemptible VM and it was preempted
>> by the underlying cluster manager without notice or if the worker was too
>> busy with other stuff (eg calling other Teardown functions) until the
>> preemption timeout elapsed, in case the underlying hardware simply failed
>> (which happens quite often at scale), and in many other conditions.
>>
>> "Best effort" is the commonly used term to describe such behavior. Please
>> feel free to file bugs for cases where you observed a runner not call
>> Teardown in a situation where it was possible to call it but the runner
>> made insufficient effort.
>>
>> On Sun, Feb 18, 2018, 9:02 AM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>>
>>> 2018-02-18 18:00 GMT+01:00 Eugene Kirpichov <kirpic...@google.com>:
>>>
>>>>
>>>>
>>>> On Sun, Feb 18, 2018, 2:06 AM Romain Manni-Bucau <rmannibu...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>>
>>>>> Le 18 févr. 2018 00:23, "Kenneth Knowles" <k...@google.com> a écrit :
>>>>>
>>>>> On Sat, Feb 17, 2018 at 3:09 PM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>>
>>>>>> If you give an example of a high-level need (e.g. "I'm trying to
>>>>>> write an IO for system $x and it requires the following initialization 
>>>>>> and
>>>>>> the following cleanup logic and the following processing in between") 
>>>>>> I'll
>>>>>> be better able to help you.
>>>>>>
>>>>>>
>>>>>> Take a simple example of a transform requiring a connection. Using
>>>>>> bundles is a perf killer since size is not controlled. Using teardown
>>>>>> doesnt allow you to release the connection since it is a best effort 
>>>>>> thing.
>>>>>> Not releasing the connection makes you pay a lot - aws ;) - or prevents 
>>>>>> you
>>>>>> to launch other processings - concurrent limit.
>>>>>>
>>>>>
>>>>> For this example @Teardown is an exact fit. If things die so badly
>>>>> that @Teardown is not called then nothing else can be called to close the
>>>>> connection either. What AWS service are you thinking of that stays open 
>>>>> for
>>>>> a long time when everything at the other end has died?
>>>>>
>>>>>
>>>>> You assume connections are kind of stateless but some (proprietary)
>>>>> protocols requires some closing exchanges which are not only "im leaving".
>>>>>
>>>>> For aws i was thinking about starting some services - machines - on
>>>>> the fly in a pipeline startup and closing them at the end. If teardown is
>>>>> not called you leak machines and money. You can say it can be done another
>>>>> way...as the full pipeline ;).
>>>>>
>>>>> I dont want to be picky but if beam cant handle its components
>>>>> lifecycle it can be used at scale for generic pipelines and if bound to
>>>>> some particular IO.
>>>>>
>>>>> What does prevent to enforce teardown - ignoring the interstellar
>>>>> crash case which cant be handled by any human system? Nothing technically.
>>>>> Why do you push to not handle it? Is it due to some legacy code on 
>>>>> dataflow
>>>>> or something else?
>>>>>
>>>> Teardown *is* already documented and implemented this way
>>>> (best-effort). So I'm not sure what kind of change you're asking for.
>>>>
>>>
>>> Remove "best effort" from the javadoc. If it is not call then it is a
>>> bug and we are done :).
>>>
>>>
>>>>
>>>>
>>>>> Also what does it mean for the users? Direct runner does it so if a
>>>>> user udes the RI in test, he will get a different behavior in prod? Also
>>>>> dont forget the user doesnt know what the IOs he composes use so this is 
>>>>> so
>>>>> impacting for the whole product than he must be handled IMHO.
>>>>>
>>>>> I understand the portability culture is new in big data world but it
>>>>> is not a reason to ignore what people did for years and do it wrong before
>>>>> doing right ;).
>>>>>
>>>>> My proposal is to list what can prevent to guarantee - in the normal
>>>>> IT conditions - the execution of teardown. Then we see if we can handle it
>>>>> and only if there is a technical reason we cant we make it
>>>>> experimental/unsupported in the api. I know spark and flink can, any
>>>>> unknown blocker for other runners?
>>>>>
>>>>> Technical note: even a kill should go through java shutdown hooks
>>>>> otherwise your environment (beam enclosing software) is fully unhandled 
>>>>> and
>>>>> your overall system is uncontrolled. Only case where it is not true is 
>>>>> when
>>>>> the software is always owned by a vendor and never installed on customer
>>>>> environment. In this case it belongd to the vendor to handle beam API and
>>>>> not to beam to adjust its API for a vendor - otherwise all unsupported
>>>>> features by one runner should be made optional right?
>>>>>
>>>>> All state is not about network, even in distributed systems so this is
>>>>> key to have an explicit and defined lifecycle.
>>>>>
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>
>

Re: @TearDown guarantees

Reply via email to