Unfortunately several runners (at least Flink and Dataflow) support in-place update of streaming pipelines as a key feature, and changing coder format breaks this. This is a very important feature of both runners, and we should endeavor not to break them.
In-place snapshot and update is also a top-level Beam proposal that was received positively, though neither of those runners yet implement the proposed interface. Reuven On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com> wrote: > Sadly yes, and why the PR is actually WIP. As mentionned it modifies it > and requires some updates in other languages and the standard_coders.yml > file (I didn't find how this file was generated). > Since coders must be about volatile data I don't think it is a big deal to > change it though. > > > Romain Manni-Bucau > @rmannibucau <https://twitter.com/rmannibucau> | Blog > <https://rmannibucau.metawerx.net/> | Old Blog > <http://rmannibucau.wordpress.com> | Github > <https://github.com/rmannibucau> | LinkedIn > <https://www.linkedin.com/in/rmannibucau> | Book > <https://www.packtpub.com/application-development/java-ee-8-high-performance> > > 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>: > >> One question - does this change the actual byte encoding of elements? >> We've tried hard not to do that so far for reasons of compatibility. >> >> Reuven >> >> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com >> > wrote: >> >>> Hi guys, >>> >>> I submitted a PR on coders to enhance 1. the user experience 2. the >>> determinism and handling of coders. >>> >>> 1. the user experience is linked to what i sent some days ago: close >>> handling of the streams from a coder code. Long story short I add a >>> SkipCloseCoder which can decorate a coder and just wraps the stream (input >>> or output) in flavors skipping close() calls. This avoids to do it by >>> default (which had my preference if you read the related thread but not the >>> one of everybody) but also makes the usage of a coder with this issue easy >>> since the of() of the coder just wraps itself in this delagating coder. >>> >>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These >>> ones use this kind of algorithm (keep in mind they work on a list): >>> >>> writeSize() >>> for all element e { >>> elementCoder.write(e) >>> } >>> writeMagicNumber() // this one depends the size >>> >>> The decoding is symmetric so I bypass it here. >>> >>> Indeed all these writes (reads) are done on the same stream. Therefore >>> it assumes you read as much bytes than you write...which is a huge >>> assumption for a coder which should by contract assume it can read the >>> stream...as a stream (until -1). >>> >>> The idea of the fix is to change this encoding to this kind of algorithm: >>> >>> writeSize() >>> for all element e { >>> writeElementByteCount(e) >>> elementCoder.write(e) >>> } >>> writeMagicNumber() // still optionally >>> >>> This way on the decode size you can wrap the stream by element to >>> enforce the limitation of the byte count. >>> >>> Side note: this indeed enforce a limitation due to java byte limitation >>> but if you check coder code it is already here at the higher level so it is >>> not a big deal for now. >>> >>> In terms of implementation it uses a LengthAwareCoder which delegates to >>> another coder the encoding and just adds the byte count before the actual >>> serialization. Not perfect but should be more than enough in terms of >>> support and perf for beam if you think real pipelines (we try to avoid >>> serializations or it is done on some well known points where this algo >>> should be enough...worse case it is not a huge overhead, mainly just some >>> memory overhead). >>> >>> >>> The PR is available at https://github.com/apache/beam/pull/4594. If you >>> check you will see I put it "WIP". The main reason is that it changes the >>> encoding format for containers (lists, iterable, ...) and therefore breaks >>> python/go/... tests and the standard_coders.yml definition. Some help on >>> that would be very welcomed. >>> >>> Technical side note if you wonder: UnownedInputStream doesn't even >>> allow to mark the stream so there is no real fast way to read the stream as >>> fast as possible with standard buffering strategies and to support this >>> automatic IterableCoder wrapping which is implicit. In other words, if beam >>> wants to support any coder, including the ones not requiring to write the >>> size of the output - most of the codecs - then we need to change the way it >>> works to something like that which does it for the user which doesn't know >>> its coder got wrapped. >>> >>> Hope it makes sense, if not, don't hesitate to ask questions. >>> >>> Happy end of week-end. >>> >>> Romain Manni-Bucau >>> @rmannibucau <https://twitter.com/rmannibucau> | Blog >>> <https://rmannibucau.metawerx.net/> | Old Blog >>> <http://rmannibucau.wordpress.com> | Github >>> <https://github.com/rmannibucau> | LinkedIn >>> <https://www.linkedin.com/in/rmannibucau> | Book >>> <https://www.packtpub.com/application-development/java-ee-8-high-performance> >>> >> >> >