I like this idea of migration support at coder level. It would require to
add a metadata in all outputs which would represent the version then coders
can handle the logic properly depending the version - we can assume a coder
dev upgrade the version when he breaks the representation I hope ;).
With this: no runner impact at all :).

Romain Manni-Bucau
@rmannibucau <https://twitter.com/rmannibucau> |  Blog
<https://rmannibucau.metawerx.net/> | Old Blog
<http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book

2018-02-04 18:09 GMT+01:00 Reuven Lax <re...@google.com>:

> It would already break quite a number of users at this point.
> I think what we should be doing is moving forward on the snapshot/update
> proposal. That proposal actually provides a way forward when coders change
> (it proposes a way to map an old snapshot to one using the new coder, so
> changes to coders in the future will be much easier to make. However much
> of the implementation for this will likely be at the runner level, not the
> SDK level.
> Reuven
> On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
> wrote:
>> I fully understand that, and this is one of the reason managing to solve
>> these issues is very important and ASAP. My conclusion is that we must
>> break it now to avoid to do it later when usage will be way more developped
>> - I would be very happy to be wrong on that point - so I started this PR
>> and this thread. We can postpone it but it would break later so for
>> probably more users.
>> Romain Manni-Bucau
>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>> <https://rmannibucau.metawerx.net/> | Old Blog
>> <http://rmannibucau.wordpress.com> | Github
>> <https://github.com/rmannibucau> | LinkedIn
>> <https://www.linkedin.com/in/rmannibucau> | Book
>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>> 2018-02-04 17:49 GMT+01:00 Reuven Lax <re...@google.com>:
>>> Unfortunately several runners (at least Flink and Dataflow) support
>>> in-place update of streaming pipelines as a key feature, and changing coder
>>> format breaks this. This is a very important feature of both runners, and
>>> we should endeavor not to break them.
>>> In-place snapshot and update is also a top-level Beam proposal that was
>>> received positively, though neither of those runners yet implement the
>>> proposed interface.
>>> Reuven
>>> On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <
>>> rmannibu...@gmail.com> wrote:
>>>> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
>>>> and requires some updates in other languages and the standard_coders.yml
>>>> file (I didn't find how this file was generated).
>>>> Since coders must be about volatile data I don't think it is a big deal
>>>> to change it though.
>>>> Romain Manni-Bucau
>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>> <http://rmannibucau.wordpress.com> | Github
>>>> <https://github.com/rmannibucau> | LinkedIn
>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>>>>> One question - does this change the actual byte encoding of elements?
>>>>> We've tried hard not to do that so far for reasons of compatibility.
>>>>> Reuven
>>>>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <
>>>>> rmannibu...@gmail.com> wrote:
>>>>>> Hi guys,
>>>>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>>>>> determinism and handling of coders.
>>>>>> 1. the user experience is linked to what i sent some days ago: close
>>>>>> handling of the streams from a coder code. Long story short I add a
>>>>>> SkipCloseCoder which can decorate a coder and just wraps the stream 
>>>>>> (input
>>>>>> or output) in flavors skipping close() calls. This avoids to do it by
>>>>>> default (which had my preference if you read the related thread but not 
>>>>>> the
>>>>>> one of everybody) but also makes the usage of a coder with this issue 
>>>>>> easy
>>>>>> since the of() of the coder just wraps itself in this delagating coder.
>>>>>> 2. this one is more nasty and mainly concerns IterableLikeCoders.
>>>>>> These ones use this kind of algorithm (keep in mind they work on a list):
>>>>>> writeSize()
>>>>>> for all element e {
>>>>>>     elementCoder.write(e)
>>>>>> }
>>>>>> writeMagicNumber() // this one depends the size
>>>>>> The decoding is symmetric so I bypass it here.
>>>>>> Indeed all these writes (reads) are done on the same stream.
>>>>>> Therefore it assumes you read as much bytes than you write...which is a
>>>>>> huge assumption for a coder which should by contract assume it can read 
>>>>>> the
>>>>>> stream...as a stream (until -1).
>>>>>> The idea of the fix is to change this encoding to this kind of
>>>>>> algorithm:
>>>>>> writeSize()
>>>>>> for all element e {
>>>>>>     writeElementByteCount(e)
>>>>>>     elementCoder.write(e)
>>>>>> }
>>>>>> writeMagicNumber() // still optionally
>>>>>> This way on the decode size you can wrap the stream by element to
>>>>>> enforce the limitation of the byte count.
>>>>>> Side note: this indeed enforce a limitation due to java byte
>>>>>> limitation but if you check coder code it is already here at the higher
>>>>>> level so it is not a big deal for now.
>>>>>> In terms of implementation it uses a LengthAwareCoder which delegates
>>>>>> to another coder the encoding and just adds the byte count before the
>>>>>> actual serialization. Not perfect but should be more than enough in terms
>>>>>> of support and perf for beam if you think real pipelines (we try to avoid
>>>>>> serializations or it is done on some well known points where this algo
>>>>>> should be enough...worse case it is not a huge overhead, mainly just some
>>>>>> memory overhead).
>>>>>> The PR is available at https://github.com/apache/beam/pull/4594. If
>>>>>> you check you will see I put it "WIP". The main reason is that it changes
>>>>>> the encoding format for containers (lists, iterable, ...) and therefore
>>>>>> breaks python/go/... tests and the standard_coders.yml definition. Some
>>>>>> help on that would be very welcomed.
>>>>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>>>>> allow to mark the stream so there is no real fast way to read the stream 
>>>>>> as
>>>>>> fast as possible with standard buffering strategies and to support this
>>>>>> automatic IterableCoder wrapping which is implicit. In other words, if 
>>>>>> beam
>>>>>> wants to support any coder, including the ones not requiring to write the
>>>>>> size of the output - most of the codecs - then we need to change the way 
>>>>>> it
>>>>>> works to something like that which does it for the user which doesn't 
>>>>>> know
>>>>>> its coder got wrapped.
>>>>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>>>> Happy end of week-end.
>>>>>> Romain Manni-Bucau
>>>>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>>>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>>>>> <http://rmannibucau.wordpress.com> | Github
>>>>>> <https://github.com/rmannibucau> | LinkedIn
>>>>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>>>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>

Reply via email to