I fully understand that, and this is one of the reason managing to solve
these issues is very important and ASAP. My conclusion is that we must
break it now to avoid to do it later when usage will be way more developped
- I would be very happy to be wrong on that point - so I started this PR
and this thread. We can postpone it but it would break later so for
probably more users.

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book

2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:

> Unfortunately several runners (at least Flink and Dataflow) support
> in-place update of streaming pipelines as a key feature, and changing coder
> format breaks this. This is a very important feature of both runners, and
> we should endeavor not to break them.
> In-place snapshot and update is also a top-level Beam proposal that was
> received positively, though neither of those runners yet implement the
> proposed interface.
> Reuven
> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>> and requires some updates in other languages and the standard_coders.yml
>> file (I didn't find how this file was generated).
>> Since coders must be about volatile data I don't think it is a big deal
>> to change it though.
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>> One question - does this change the actual byte encoding of elements?
>>> We've tried hard not to do that so far for reasons of compatibility.
>>> Reuven
>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>> Hi guys,
>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>> determinism and handling of coders.
>>>> 1. the user experience is linked to what i sent some days ago: close
>>>> handling of the streams from a coder code. Long story short I add a
>>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>> default (which had my preference if you read the related thread but not the
>>>> one of everybody) but also makes the usage of a coder with this issue easy
>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>>>> ones use this kind of algorithm (keep in mind they work on a list):
>>>> writeSize()
>>>> for all element e {
>>>>     elementCoder.write(e)
>>>> }
>>>> writeMagicNumber() // this one depends the size
>>>> The decoding is symmetric so I bypass it here.
>>>> Indeed all these writes (reads) are done on the same stream. Therefore
>>>> it assumes you read as much bytes than you write...which is a huge
>>>> assumption for a coder which should by contract assume it can read the
>>>> stream...as a stream (until -1).
>>>> The idea of the fix is to change this encoding to this kind of
>>>> algorithm:
>>>> writeSize()
>>>> for all element e {
>>>>     writeElementByteCount(e)
>>>>     elementCoder.write(e)
>>>> }
>>>> writeMagicNumber() // still optionally
>>>> This way on the decode size you can wrap the stream by element to
>>>> enforce the limitation of the byte count.
>>>> Side note: this indeed enforce a limitation due to java byte limitation
>>>> but if you check coder code it is already here at the higher level so it is
>>>> not a big deal for now.
>>>> In terms of implementation it uses a LengthAwareCoder which delegates
>>>> to another coder the encoding and just adds the byte count before the
>>>> actual serialization. Not perfect but should be more than enough in terms
>>>> of support and perf for beam if you think real pipelines (we try to avoid
>>>> serializations or it is done on some well known points where this algo
>>>> should be enough...worse case it is not a huge overhead, mainly just some
>>>> memory overhead).
>>>> The PR is available at https://github.com/apache/beam/pull/4594. If
>>>> you check you will see I put it "WIP". The main reason is that it changes
>>>> the encoding format for containers (lists, iterable, ...) and therefore
>>>> breaks python/go/... tests and the standard_coders.yml definition. Some
>>>> help on that would be very welcomed.
>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>> allow to mark the stream so there is no real fast way to read the stream as
>>>> fast as possible with standard buffering strategies and to support this
>>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>>> wants to support any coder, including the ones not requiring to write the
>>>> size of the output - most of the codecs - then we need to change the way it
>>>> works to something like that which does it for the user which doesn't know
>>>> its coder got wrapped.
>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>> Happy end of week-end.
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>

Reply via email to