Yes exactly JB, I just want to ensure the sdk/core API is clear and well defined and that any not respect of that falls into a runner bug. What I don't want is that a buggy impl leaks in the SDK/core definition.
Romain Manni-Bucau @rmannibucau <https://twitter.com/rmannibucau> | Blog <https://rmannibucau.metawerx.net/> | Old Blog <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> | LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book <https://www.packtpub.com/application-development/java-ee-8-high-performance> 2018-02-18 17:56 GMT+01:00 Jean-Baptiste Onofré <j...@nanthrax.net>: > My bad, I thought you talked about guarantee in the Runner API. > > If it's semantic point in the SDK (enforcement instead of best effort), > and then if the runner doesn't respect that, it's a limitation/bug in the > runner, I would agree with that. > > Regards > JB > > On 18/02/2018 16:58, Romain Manni-Bucau wrote: > >> >> >> Le 18 févr. 2018 15:39, "Jean-Baptiste Onofré" <j...@nanthrax.net <mailto: >> j...@nanthrax.net>> a écrit : >> >> Hi, >> >> I think, as you said, it depends of the protocol and the IO. >> >> For instance, in first version of JdbcIO, I created the connections >> in @Setup >> and released in @Teardown. >> >> But, in case of streaming system, it's not so good (especially for >> pooling) as >> the connection stays open for a very long time. >> >> >> Hmm can be discussed in practise (both pooling and connection holding for >> jdbc in beam context) but lets assume it. >> >> >> So, I updated to deal with connection in @StartBundle and release in >> @FinishBundle. >> >> >> >> Which leads to an unpredictable bundle size and therefore very very bad >> perfs on write size - read size is faked but in mem buffer i guess which >> breaks the bundle definition but let s ignore it too for now. >> >> >> So, I think it depends of the kind of connections: the kind of >> connection >> actually holding resources should be manage in bundle (at least for >> now), the >> other kind of connection (just wrapping configuration but not >> holding resources >> like Apache HTTP Component Client for instance) could be dealt in >> DoFn lifecycle. >> >> >> >> Once again, I would be ok with bundles for now - but it doesnt solve the >> real issue - if bundles are up to the user. Since it is not, it doesnt help >> and can just degrade the overall behavior in both batch and streaming. >> >> I fully understand beam doesnt handle properly that today. What does >> block to do it? Nothing technical so why not doing it? >> >> Technically: >> >> 1. Teardown can be guaranteed >> 2. Bundle size can be highly influenced / configured by user >> >> Both are needed to be able to propose a strong api compared to >> competitors and aims to not only have disavantages going portable for users. >> >> Let just do it, no? >> >> >> Regards >> JB >> >> On 02/18/2018 11:05 AM, Romain Manni-Bucau wrote: >> > >> > >> > Le 18 févr. 2018 00:23, "Kenneth Knowles" <k...@google.com >> <mailto:k...@google.com> >> > <mailto:k...@google.com <mailto:k...@google.com>>> a écrit : >> > >> > On Sat, Feb 17, 2018 at 3:09 PM, Romain Manni-Bucau >> <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com> >> > <mailto:rmannibu...@gmail.com >> >> <mailto:rmannibu...@gmail.com>>> wrote: >> > >> > If you give an example of a high-level need (e.g. >> "I'm trying to >> > write an IO for system $x and it requires the following >> > initialization and the following cleanup logic and >> the following >> > processing in between") I'll be better able to help >> you. >> > >> > >> > Take a simple example of a transform requiring a >> connection. Using >> > bundles is a perf killer since size is not controlled. >> Using teardown >> > doesnt allow you to release the connection since it is a >> best effort >> > thing. Not releasing the connection makes you pay a lot - >> aws ;) - or >> > prevents you to launch other processings - concurrent >> limit. >> > >> > >> > For this example @Teardown is an exact fit. If things die so >> badly that >> > @Teardown is not called then nothing else can be called to >> close the >> > connection either. What AWS service are you thinking of that >> stays open for >> > a long time when everything at the other end has died? >> > >> > >> > You assume connections are kind of stateless but some >> (proprietary) protocols >> > requires some closing exchanges which are not only "im leaving". >> > >> > For aws i was thinking about starting some services - machines - >> on the fly in a >> > pipeline startup and closing them at the end. If teardown is not >> called you leak >> > machines and money. You can say it can be done another way...as >> the full >> > pipeline ;). >> > >> > I dont want to be picky but if beam cant handle its components >> lifecycle it can >> > be used at scale for generic pipelines and if bound to some >> particular IO. >> > >> > What does prevent to enforce teardown - ignoring the interstellar >> crash case >> > which cant be handled by any human system? Nothing technically. >> Why do you push >> > to not handle it? Is it due to some legacy code on dataflow or >> something else? >> > >> > Also what does it mean for the users? Direct runner does it so if >> a user udes >> > the RI in test, he will get a different behavior in prod? Also >> dont forget the >> > user doesnt know what the IOs he composes use so this is so >> impacting for the >> > whole product than he must be handled IMHO. >> > >> > I understand the portability culture is new in big data world but >> it is not a >> > reason to ignore what people did for years and do it wrong before >> doing right ;). >> > >> > My proposal is to list what can prevent to guarantee - in the >> normal IT >> > conditions - the execution of teardown. Then we see if we can >> handle it and only >> > if there is a technical reason we cant we make it >> experimental/unsupported in >> > the api. I know spark and flink can, any unknown blocker for >> other runners? >> > >> > Technical note: even a kill should go through java shutdown hooks >> otherwise your >> > environment (beam enclosing software) is fully unhandled and your >> overall system >> > is uncontrolled. Only case where it is not true is when the >> software is always >> > owned by a vendor and never installed on customer environment. In >> this case it >> > belongd to the vendor to handle beam API and not to beam to >> adjust its API for a >> > vendor - otherwise all unsupported features by one runner should >> be made >> > optional right? >> > >> > All state is not about network, even in distributed systems so >> this is key to >> > have an explicit and defined lifecycle. >> > >> > >> > Kenn >> > >> > >> >> -- >> Jean-Baptiste Onofré >> jbono...@apache.org <mailto:jbono...@apache.org> >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> >> >>