Would this work for everyone - can update the pr if so: If coder is not built in Prefix with byte size Else Current behavior
? Le 5 févr. 2018 19:21, "Romain Manni-Bucau" <rmannibu...@gmail.com> a écrit : > Answered inlined but I want to highlight beam is a portable API on top of > well known vendors API which have friendly shortcuts. So the background > here is to make beam at least user friendly. > > Im fine if the outcome of the discussion is coder concept is wrong or > something like that but Im not fine to say we dont want to solve an API > issue, to not say bug, of a project which has an API as added value. > > I understand the perf concern which must be balanced with the fact > derialization is not used for each step/transform and that currently the > coder API is already intrusive and heavy for dev but also not usable by > most existing codecs out there. Even some jaxb or plain xml flavors dont > work with it :(. > > Le 5 févr. 2018 18:46, "Robert Bradshaw" <rober...@google.com> a écrit : > > On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau > <rmannibu...@gmail.com> wrote: > > Hi guys, > > > > I submitted a PR on coders to enhance 1. the user experience 2. the > > determinism and handling of coders. > > > > 1. the user experience is linked to what i sent some days ago: close > > handling of the streams from a coder code. Long story short I add a > > SkipCloseCoder which can decorate a coder and just wraps the stream > (input > > or output) in flavors skipping close() calls. This avoids to do it by > > default (which had my preference if you read the related thread but not > the > > one of everybody) but also makes the usage of a coder with this issue > easy > > since the of() of the coder just wraps itself in this delagating coder. > > > > 2. this one is more nasty and mainly concerns IterableLikeCoders. These > ones > > use this kind of algorithm (keep in mind they work on a list): > > > > writeSize() > > for all element e { > > elementCoder.write(e) > > } > > writeMagicNumber() // this one depends the size > > > > The decoding is symmetric so I bypass it here. > > > > Indeed all these writes (reads) are done on the same stream. Therefore it > > assumes you read as much bytes than you write...which is a huge > assumption > > for a coder which should by contract assume it can read the stream...as a > > stream (until -1). > > > > The idea of the fix is to change this encoding to this kind of algorithm: > > > > writeSize() > > for all element e { > > writeElementByteCount(e) > > elementCoder.write(e) > > } > > writeMagicNumber() // still optionally > > Regardless of the backwards incompatibility issues, I'm unconvinced > that prefixing every element with its length is a good idea. It can > lead to blow-up in size (e.g. a list of ints, and it should be noted > that containers with lots of elements bias towards having small > elements) and also writeElementByteCount(e) could be very inefficient > for many type e (e.g. a list of lists). > > > What is your proposal Robert then? Current restriction is clearly a > blocker for portability, users, determinism and is unsafe and only > checkable at runtime so not something we should lead to keep. > > Alternative i thought about was to forbid implicit coders but it doesnt > help users. > > > > > This way on the decode size you can wrap the stream by element to enforce > > the limitation of the byte count. > > > > Side note: this indeed enforce a limitation due to java byte limitation > but > > if you check coder code it is already here at the higher level so it is > not > > a big deal for now. > > > > In terms of implementation it uses a LengthAwareCoder which delegates to > > another coder the encoding and just adds the byte count before the actual > > serialization. Not perfect but should be more than enough in terms of > > support and perf for beam if you think real pipelines (we try to avoid > > serializations or it is done on some well known points where this algo > > should be enough...worse case it is not a huge overhead, mainly just some > > memory overhead). > > > > > > The PR is available at https://github.com/apache/beam/pull/4594. If you > > check you will see I put it "WIP". The main reason is that it changes the > > encoding format for containers (lists, iterable, ...) and therefore > breaks > > python/go/... tests and the standard_coders.yml definition. Some help on > > that would be very welcomed. > > > > Technical side note if you wonder: UnownedInputStream doesn't even allow > to > > mark the stream so there is no real fast way to read the stream as fast > as > > possible with standard buffering strategies and to support this automatic > > IterableCoder wrapping which is implicit. In other words, if beam wants > to > > support any coder, including the ones not requiring to write the size of > the > > output - most of the codecs - then we need to change the way it works to > > something like that which does it for the user which doesn't know its > coder > > got wrapped. > > > > Hope it makes sense, if not, don't hesitate to ask questions. > > > > Happy end of week-end. > > > > Romain Manni-Bucau > > @rmannibucau | Blog | Old Blog | Github | LinkedIn | Book > > >