Re: @TearDown guarantees

Jean-Baptiste Onofré Sun, 18 Feb 2018 08:58:10 -0800

My bad, I thought you talked about guarantee in the Runner API.

If it's semantic point in the SDK (enforcement instead of best effort),and then if the runner doesn't respect that, it's a limitation/bug inthe runner, I would agree with that.


Regards
JB

On 18/02/2018 16:58, Romain Manni-Bucau wrote:

Le 18 févr. 2018 15:39, "Jean-Baptiste Onofré" <j...@nanthrax.net<mailto:j...@nanthrax.net>> a écrit :


    Hi,

    I think, as you said, it depends of the protocol and the IO.

    For instance, in first version of JdbcIO, I created the connections
    in @Setup
    and released in @Teardown.

    But, in case of streaming system, it's not so good (especially for
    pooling) as
    the connection stays open for a very long time.

Hmm can be discussed in practise (both pooling and connection holdingfor jdbc in beam context) but lets assume it.



    So, I updated to deal with connection in @StartBundle and release in
    @FinishBundle.

Which leads to an unpredictable bundle size and therefore very very badperfs on write size - read size is faked but in mem buffer i guess whichbreaks the bundle definition but let s ignore it too for now.



    So, I think it depends of the kind of connections: the kind of
    connection
    actually holding resources should be manage in bundle (at least for
    now), the
    other kind of connection (just wrapping configuration but not
    holding resources
    like Apache HTTP Component Client for instance) could be dealt in
    DoFn lifecycle.

Once again, I would be ok with bundles for now - but it doesnt solve thereal issue - if bundles are up to the user. Since it is not, it doesnthelp and can just degrade the overall behavior in both batch and streaming.

I fully understand beam doesnt handle properly that today. What doesblock to do it? Nothing technical so why not doing it?


Technically:

1. Teardown can be guaranteed
2. Bundle size can be highly influenced / configured by user

Both are needed to be able to propose a strong api compared tocompetitors and aims to not only have disavantages going portable for users.


Let just do it, no?


    Regards
    JB

    On 02/18/2018 11:05 AM, Romain Manni-Bucau wrote:
     >
     >
     > Le 18 févr. 2018 00:23, "Kenneth Knowles" <k...@google.com
    <mailto:k...@google.com>
     > <mailto:k...@google.com <mailto:k...@google.com>>> a écrit :
     >
     >     On Sat, Feb 17, 2018 at 3:09 PM, Romain Manni-Bucau
    <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
     >     <mailto:rmannibu...@gmail.com
    <mailto:rmannibu...@gmail.com>>> wrote:
     >
     >             If you give an example of a high-level need (e.g.
    "I'm trying to
     >             write an IO for system $x and it requires the following
     >             initialization and the following cleanup logic and
    the following
     >             processing in between") I'll be better able to help you.
     >
     >
     >         Take a simple example of a transform requiring a
    connection. Using
     >         bundles is a perf killer since size is not controlled.
    Using teardown
     >         doesnt allow you to release the connection since it is a
    best effort
     >         thing. Not releasing the connection makes you pay a lot -
    aws ;) - or
     >         prevents you to launch other processings - concurrent limit.
     >
     >
     >     For this example @Teardown is an exact fit. If things die so
    badly that
     >     @Teardown is not called then nothing else can be called to
    close the
     >     connection either. What AWS service are you thinking of that
    stays open for
     >     a long time when everything at the other end has died?
     >
     >
     > You assume connections are kind of stateless but some
    (proprietary) protocols
     > requires some closing exchanges which are not only "im leaving".
     >
     > For aws i was thinking about starting some services - machines -
    on the fly in a
     > pipeline startup and closing them at the end. If teardown is not
    called you leak
     > machines and money. You can say it can be done another way...as
    the full
     > pipeline ;).
     >
     > I dont want to be picky but if beam cant handle its components
    lifecycle it can
     > be used at scale for generic pipelines and if bound to some
    particular IO.
     >
     > What does prevent to enforce teardown - ignoring the interstellar
    crash case
     > which cant be handled by any human system? Nothing technically.
    Why do you push
     > to not handle it? Is it due to some legacy code on dataflow or
    something else?
     >
     > Also what does it mean for the users? Direct runner does it so if
    a user udes
     > the RI in test, he will get a different behavior in prod? Also
    dont forget the
     > user doesnt know what the IOs he composes use so this is so
    impacting for the
     > whole product than he must be handled IMHO.
     >
     > I understand the portability culture is new in big data world but
    it is not a
     > reason to ignore what people did for years and do it wrong before
    doing right ;).
     >
     > My proposal is to list what can prevent to guarantee - in the
    normal IT
     > conditions - the execution of teardown. Then we see if we can
    handle it and only
     > if there is a technical reason we cant we make it
    experimental/unsupported in
     > the api. I know spark and flink can, any unknown blocker for
    other runners?
     >
     > Technical note: even a kill should go through java shutdown hooks
    otherwise your
     > environment (beam enclosing software) is fully unhandled and your
    overall system
     > is uncontrolled. Only case where it is not true is when the
    software is always
     > owned by a vendor and never installed on customer environment. In
    this case it
     > belongd to the vendor to handle beam API and not to beam to
    adjust its API for a
     > vendor - otherwise all unsupported features by one runner should
    be made
     > optional right?
     >
     > All state is not about network, even in distributed systems so
    this is key to
     > have an explicit and defined lifecycle.
     >
     >
     >     Kenn
     >
     >

    --
    Jean-Baptiste Onofré
    jbono...@apache.org <mailto:jbono...@apache.org>
    http://blog.nanthrax.net
    Talend - http://www.talend.com

Re: @TearDown guarantees

Reply via email to