Do we have a good place to track the items for Beam 3.0, or is Jira the best place? Romain has a good point - if this gets forgotten when we do Beam 3.0, then we're stuck waiting around till Beam 4.0.
Reuven On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote: > That's a good point. In the roadmap for Beam 3, I think it makes sense to > add a > point about this. > > Regards > JB > > On 02/04/2018 06:18 PM, Eugene Kirpichov wrote: > > I think doing a change that would break pipeline update for every single > user of > > Flink and Dataflow needs to be postponed until a next major version. > Pipeline > > update is a very frequently used feature, especially by the largest > users. We've > > had those users get significantly upset even when we accidentally broke > update > > compatibility for some special cases of individual transforms; breaking > it > > intentionally and project-wide is too extreme to be justified by the > benefits of > > the current change. > > > > That said, I think concerns about coder APIs are reasonable, and it is > > unfortunate that we effectively can't make changes to them right now. It > would > > be great if in the next major version we were better prepared for > evolution of > > coders, e.g. by having coders support a version marker or something like > that, > > with an API for detecting the version of data on wire and reading or > writing > > data of an old version. Such a change (introducing versioning) would > also, of > > course, be incompatible and would need to be postponed until a major > version - > > but, at least, subsequent changes wouldn't. > > > > ...And as I was typing this email, seems that this is what the thread > already > > came to! > > > > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <rmannibu...@gmail.com > > <mailto:rmannibu...@gmail.com>> wrote: > > > > I like this idea of migration support at coder level. It would > require to > > add a metadata in all outputs which would represent the version then > coders > > can handle the logic properly depending the version - we can assume > a coder > > dev upgrade the version when he breaks the representation I hope ;). > > With this: no runner impact at all :). > > > > > > Romain Manni-Bucau > > @rmannibucau <https://twitter.com/rmannibucau> | Blog > > <https://rmannibucau.metawerx.net/> | Old Blog > > <http://rmannibucau.wordpress.com> | Github > > <https://github.com/rmannibucau> | LinkedIn > > <https://www.linkedin.com/in/rmannibucau> | Book > > <https://www.packtpub.com/application-development/java- > ee-8-high-performance> > > > > 2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com > > <mailto:re...@google.com>>: > > > > It would already break quite a number of users at this point. > > > > I think what we should be doing is moving forward on the > snapshot/update > > proposal. That proposal actually provides a way forward when > coders > > change (it proposes a way to map an old snapshot to one using > the new > > coder, so changes to coders in the future will be much easier to > make. > > However much of the implementation for this will likely be at > the runner > > level, not the SDK level. > > > > Reuven > > > > On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau > > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> wrote: > > > > I fully understand that, and this is one of the reason > managing to > > solve these issues is very important and ASAP. My conclusion > is that > > we must break it now to avoid to do it later when usage will > be way > > more developped - I would be very happy to be wrong on that > point - > > so I started this PR and this thread. We can postpone it but > it > > would break later so for probably more users. > > > > > > Romain Manni-Bucau > > @rmannibucau <https://twitter.com/rmannibucau> | Blog > > <https://rmannibucau.metawerx.net/> | Old Blog > > <http://rmannibucau.wordpress.com> | Github > > <https://github.com/rmannibucau> | LinkedIn > > <https://www.linkedin.com/in/rmannibucau> | Book > > <https://www.packtpub.com/application-development/java- > ee-8-high-performance> > > > > 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com > > <mailto:re...@google.com>>: > > > > Unfortunately several runners (at least Flink and > Dataflow) > > support in-place update of streaming pipelines as a key > feature, > > and changing coder format breaks this. This is a very > important > > feature of both runners, and we should endeavor not to > break them. > > > > In-place snapshot and update is also a top-level Beam > proposal > > that was received positively, though neither of those > runners > > yet implement the proposed interface. > > > > Reuven > > > > On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau > > <rmannibu...@gmail.com <mailto:rmannibu...@gmail.com>> > wrote: > > > > Sadly yes, and why the PR is actually WIP. As > mentionned it > > modifies it and requires some updates in other > languages and > > the standard_coders.yml file (I didn't find how this > file > > was generated). > > Since coders must be about volatile data I don't > think it is > > a big deal to change it though. > > > > > > Romain Manni-Bucau > > @rmannibucau <https://twitter.com/rmannibucau> | > Blog > > <https://rmannibucau.metawerx.net/> | Old Blog > > <http://rmannibucau.wordpress.com> | Github > > <https://github.com/rmannibucau> | LinkedIn > > <https://www.linkedin.com/in/rmannibucau> | Book > > <https://www.packtpub.com/ > application-development/java-ee-8-high-performance> > > > > 2018-02-04 17:34 GMT+01:00 Reuven Lax < > re...@google.com > > <mailto:re...@google.com>>: > > > > One question - does this change the actual byte > encoding > > of elements? We've tried hard not to do that so > far for > > reasons of compatibility. > > > > Reuven > > > > On Sun, Feb 4, 2018 at 6:44 AM, Romain > Manni-Bucau > > <rmannibu...@gmail.com <mailto: > rmannibu...@gmail.com>> > > wrote: > > > > Hi guys, > > > > I submitted a PR on coders to enhance 1. the > user > > experience 2. the determinism and handling > of coders. > > > > 1. the user experience is linked to what i > sent some > > days ago: close handling of the streams from > a coder > > code. Long story short I add a > SkipCloseCoder which > > can decorate a coder and just wraps the > stream > > (input or output) in flavors skipping > close() calls. > > This avoids to do it by default (which had my > > preference if you read the related thread > but not > > the one of everybody) but also makes the > usage of a > > coder with this issue easy since the of() of > the > > coder just wraps itself in this delagating > coder. > > > > 2. this one is more nasty and mainly concerns > > IterableLikeCoders. These ones use this kind > of > > algorithm (keep in mind they work on a list): > > > > writeSize() > > for all element e { > > elementCoder.write(e) > > } > > writeMagicNumber() // this one depends the > size > > > > The decoding is symmetric so I bypass it > here. > > > > Indeed all these writes (reads) are done on > the same > > stream. Therefore it assumes you read as > much bytes > > than you write...which is a huge assumption > for a > > coder which should by contract assume it can > read > > the stream...as a stream (until -1). > > > > The idea of the fix is to change this > encoding to > > this kind of algorithm: > > > > writeSize() > > for all element e { > > writeElementByteCount(e) > > elementCoder.write(e) > > } > > writeMagicNumber() // still optionally > > > > This way on the decode size you can wrap the > stream > > by element to enforce the limitation of the > byte count. > > > > Side note: this indeed enforce a limitation > due to > > java byte limitation but if you check coder > code it > > is already here at the higher level so it is > not a > > big deal for now. > > > > In terms of implementation it uses a > > LengthAwareCoder which delegates to another > coder > > the encoding and just adds the byte count > before the > > actual serialization. Not perfect but should > be more > > than enough in terms of support and perf for > beam if > > you think real pipelines (we try to avoid > > serializations or it is done on some well > known > > points where this algo should be > enough...worse case > > it is not a huge overhead, mainly just some > memory > > overhead). > > > > > > The PR is available > > at https://github.com/apache/beam/pull/4594. > If you > > check you will see I put it "WIP". The main > reason > > is that it changes the encoding format for > > containers (lists, iterable, ...) and > therefore > > breaks python/go/... tests and the > > standard_coders.yml definition. Some help on > that > > would be very welcomed. > > > > Technical side note if you > > wonder: UnownedInputStream doesn't even > allow to > > mark the stream so there is no real fast way > to read > > the stream as fast as possible with standard > > buffering strategies and to support this > automatic > > IterableCoder wrapping which is implicit. In > other > > words, if beam wants to support any coder, > including > > the ones not requiring to write the size of > the > > output - most of the codecs - then we need > to change > > the way it works to something like that > which does > > it for the user which doesn't know its coder > got > > wrapped. > > > > Hope it makes sense, if not, don't hesitate > to ask > > questions. > > > > Happy end of week-end. > > > > Romain Manni-Bucau > > @rmannibucau <https://twitter.com/ > rmannibucau> | > > Blog <https://rmannibucau.metawerx.net/> | > Old Blog > > <http://rmannibucau.wordpress.com> | Github > > <https://github.com/rmannibucau> | LinkedIn > > <https://www.linkedin.com/in/rmannibucau> | > Book > > <https://www.packtpub.com/ > application-development/java-ee-8-high-performance> > > > > > > > > > > > > > > > > -- > Jean-Baptiste Onofré > jbono...@apache.org > http://blog.nanthrax.net > Talend - http://www.talend.com >