Yes exactly JB, I just want to ensure the sdk/core API is clear and well
defined and that any not respect of that falls into a runner bug. What I
don't want is that a buggy impl leaks in the SDK/core definition.


Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
<https://www.packtpub.com/application-development/java-ee-8-high-performance>

2018-02-18 17:56 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>:

> My bad, I thought you talked about guarantee in the Runner API.
>
> If it's semantic point in the SDK (enforcement instead of best effort),
> and then if the runner doesn't respect that, it's a limitation/bug in the
> runner, I would agree with that.
>
> Regards
> JB
>
> On 18/02/2018 16:58, Romain Manni-Bucau wrote:
>
>>
>>
>> Le 18 févr. 2018 15:39, "Jean-Baptiste Onofré" <j...@nanthrax.net <mailto:
>> j...@nanthrax.net>> a écrit :
>>
>>     Hi,
>>
>>     I think, as you said, it depends of the protocol and the IO.
>>
>>     For instance, in first version of JdbcIO, I created the connections
>>     in @Setup
>>     and released in @Teardown.
>>
>>     But, in case of streaming system, it's not so good (especially for
>>     pooling) as
>>     the connection stays open for a very long time.
>>
>>
>> Hmm can be discussed in practise (both pooling and connection holding for
>> jdbc in beam context) but lets assume it.
>>
>>
>>     So, I updated to deal with connection in @StartBundle and release in
>>     @FinishBundle.
>>
>>
>>
>> Which leads to an unpredictable bundle size and therefore very very bad
>> perfs on write size - read size is faked but in mem buffer i guess which
>> breaks the bundle definition but let s ignore it too for now.
>>
>>
>>     So, I think it depends of the kind of connections: the kind of
>>     connection
>>     actually holding resources should be manage in bundle (at least for
>>     now), the
>>     other kind of connection (just wrapping configuration but not
>>     holding resources
>>     like Apache HTTP Component Client for instance) could be dealt in
>>     DoFn lifecycle.
>>
>>
>>
>> Once again, I would be ok with bundles for now - but it doesnt solve the
>> real issue - if bundles are up to the user. Since it is not, it doesnt help
>> and can just degrade the overall behavior in both batch and streaming.
>>
>> I fully understand beam doesnt handle properly that today. What does
>> block to do it? Nothing technical so why not doing it?
>>
>> Technically:
>>
>> 1. Teardown can be guaranteed
>> 2. Bundle size can be highly influenced / configured by user
>>
>> Both are needed to be able to propose a strong api compared to
>> competitors and aims to not only have disavantages going portable for users.
>>
>> Let just do it, no?
>>
>>
>>     Regards
>>     JB
>>
>>     On 02/18/2018 11:05 AM, Romain Manni-Bucau wrote:
>>      >
>>      >
>>      > Le 18 févr. 2018 00:23, "Kenneth Knowles" <k...@google.com
>>     <mailto:k...@google.com>
>>      > <mailto:k...@google.com <mailto:k...@google.com>>> a écrit :
>>      >
>>      >     On Sat, Feb 17, 2018 at 3:09 PM, Romain Manni-Bucau
>>     <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>
>>      >     <mailto:rmannibu...@gmail.com
>>
>>     <mailto:rmannibu...@gmail.com>>> wrote:
>>      >
>>      >             If you give an example of a high-level need (e.g.
>>     "I'm trying to
>>      >             write an IO for system $x and it requires the following
>>      >             initialization and the following cleanup logic and
>>     the following
>>      >             processing in between") I'll be better able to help
>> you.
>>      >
>>      >
>>      >         Take a simple example of a transform requiring a
>>     connection. Using
>>      >         bundles is a perf killer since size is not controlled.
>>     Using teardown
>>      >         doesnt allow you to release the connection since it is a
>>     best effort
>>      >         thing. Not releasing the connection makes you pay a lot -
>>     aws ;) - or
>>      >         prevents you to launch other processings - concurrent
>> limit.
>>      >
>>      >
>>      >     For this example @Teardown is an exact fit. If things die so
>>     badly that
>>      >     @Teardown is not called then nothing else can be called to
>>     close the
>>      >     connection either. What AWS service are you thinking of that
>>     stays open for
>>      >     a long time when everything at the other end has died?
>>      >
>>      >
>>      > You assume connections are kind of stateless but some
>>     (proprietary) protocols
>>      > requires some closing exchanges which are not only "im leaving".
>>      >
>>      > For aws i was thinking about starting some services - machines -
>>     on the fly in a
>>      > pipeline startup and closing them at the end. If teardown is not
>>     called you leak
>>      > machines and money. You can say it can be done another way...as
>>     the full
>>      > pipeline ;).
>>      >
>>      > I dont want to be picky but if beam cant handle its components
>>     lifecycle it can
>>      > be used at scale for generic pipelines and if bound to some
>>     particular IO.
>>      >
>>      > What does prevent to enforce teardown - ignoring the interstellar
>>     crash case
>>      > which cant be handled by any human system? Nothing technically.
>>     Why do you push
>>      > to not handle it? Is it due to some legacy code on dataflow or
>>     something else?
>>      >
>>      > Also what does it mean for the users? Direct runner does it so if
>>     a user udes
>>      > the RI in test, he will get a different behavior in prod? Also
>>     dont forget the
>>      > user doesnt know what the IOs he composes use so this is so
>>     impacting for the
>>      > whole product than he must be handled IMHO.
>>      >
>>      > I understand the portability culture is new in big data world but
>>     it is not a
>>      > reason to ignore what people did for years and do it wrong before
>>     doing right ;).
>>      >
>>      > My proposal is to list what can prevent to guarantee - in the
>>     normal IT
>>      > conditions - the execution of teardown. Then we see if we can
>>     handle it and only
>>      > if there is a technical reason we cant we make it
>>     experimental/unsupported in
>>      > the api. I know spark and flink can, any unknown blocker for
>>     other runners?
>>      >
>>      > Technical note: even a kill should go through java shutdown hooks
>>     otherwise your
>>      > environment (beam enclosing software) is fully unhandled and your
>>     overall system
>>      > is uncontrolled. Only case where it is not true is when the
>>     software is always
>>      > owned by a vendor and never installed on customer environment. In
>>     this case it
>>      > belongd to the vendor to handle beam API and not to beam to
>>     adjust its API for a
>>      > vendor - otherwise all unsupported features by one runner should
>>     be made
>>      > optional right?
>>      >
>>      > All state is not about network, even in distributed systems so
>>     this is key to
>>      > have an explicit and defined lifecycle.
>>      >
>>      >
>>      >     Kenn
>>      >
>>      >
>>
>>     --
>>     Jean-Baptiste Onofré
>>     jbono...@apache.org <mailto:jbono...@apache.org>
>>     http://blog.nanthrax.net
>>     Talend - http://www.talend.com
>>
>>
>>

Reply via email to