Re: coder evolutions?

Reuven Lax Sun, 04 Feb 2018 08:50:40 -0800

Unfortunately several runners (at least Flink and Dataflow) support
in-place update of streaming pipelines as a key feature, and changing coder
format breaks this. This is a very important feature of both runners, and
we should endeavor not to break them.


In-place snapshot and update is also a top-level Beam proposal that was
received positively, though neither of those runners yet implement the
proposed interface.

Reuven

On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com>
wrote:

> Sadly yes, and why the PR is actually WIP. As mentionned it modifies it
> and requires some updates in other languages and the standard_coders.yml
> file (I didn't find how this file was generated).
> Since coders must be about volatile data I don't think it is a big deal to
> change it though.
>
>
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github
> <https://github.com/rmannibucau> | LinkedIn
> <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>
> 2018-02-04 17:34 GMT+01:00 Reuven Lax <re...@google.com>:
>
>> One question - does this change the actual byte encoding of elements?
>> We've tried hard not to do that so far for reasons of compatibility.
>>
>> Reuven
>>
>> On Sun, Feb 4, 2018 at 6:44 AM, Romain Manni-Bucau <rmannibu...@gmail.com
>> > wrote:
>>
>>> Hi guys,
>>>
>>> I submitted a PR on coders to enhance 1. the user experience 2. the
>>> determinism and handling of coders.
>>>
>>> 1. the user experience is linked to what i sent some days ago: close
>>> handling of the streams from a coder code. Long story short I add a
>>> SkipCloseCoder which can decorate a coder and just wraps the stream (input
>>> or output) in flavors skipping close() calls. This avoids to do it by
>>> default (which had my preference if you read the related thread but not the
>>> one of everybody) but also makes the usage of a coder with this issue easy
>>> since the of() of the coder just wraps itself in this delagating coder.
>>>
>>> 2. this one is more nasty and mainly concerns IterableLikeCoders. These
>>> ones use this kind of algorithm (keep in mind they work on a list):
>>>
>>> writeSize()
>>> for all element e {
>>>     elementCoder.write(e)
>>> }
>>> writeMagicNumber() // this one depends the size
>>>
>>> The decoding is symmetric so I bypass it here.
>>>
>>> Indeed all these writes (reads) are done on the same stream. Therefore
>>> it assumes you read as much bytes than you write...which is a huge
>>> assumption for a coder which should by contract assume it can read the
>>> stream...as a stream (until -1).
>>>
>>> The idea of the fix is to change this encoding to this kind of algorithm:
>>>
>>> writeSize()
>>> for all element e {
>>>     writeElementByteCount(e)
>>>     elementCoder.write(e)
>>> }
>>> writeMagicNumber() // still optionally
>>>
>>> This way on the decode size you can wrap the stream by element to
>>> enforce the limitation of the byte count.
>>>
>>> Side note: this indeed enforce a limitation due to java byte limitation
>>> but if you check coder code it is already here at the higher level so it is
>>> not a big deal for now.
>>>
>>> In terms of implementation it uses a LengthAwareCoder which delegates to
>>> another coder the encoding and just adds the byte count before the actual
>>> serialization. Not perfect but should be more than enough in terms of
>>> support and perf for beam if you think real pipelines (we try to avoid
>>> serializations or it is done on some well known points where this algo
>>> should be enough...worse case it is not a huge overhead, mainly just some
>>> memory overhead).
>>>
>>>
>>> The PR is available at https://github.com/apache/beam/pull/4594. If you
>>> check you will see I put it "WIP". The main reason is that it changes the
>>> encoding format for containers (lists, iterable, ...) and therefore breaks
>>> python/go/... tests and the standard_coders.yml definition. Some help on
>>> that would be very welcomed.
>>>
>>> Technical side note if you wonder: UnownedInputStream doesn't even
>>> allow to mark the stream so there is no real fast way to read the stream as
>>> fast as possible with standard buffering strategies and to support this
>>> automatic IterableCoder wrapping which is implicit. In other words, if beam
>>> wants to support any coder, including the ones not requiring to write the
>>> size of the output - most of the codecs - then we need to change the way it
>>> works to something like that which does it for the user which doesn't know
>>> its coder got wrapped.
>>>
>>> Hope it makes sense, if not, don't hesitate to ask questions.
>>>
>>> Happy end of week-end.
>>>
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github
>>> <https://github.com/rmannibucau> | LinkedIn
>>> <https://www.linkedin.com/in/rmannibucau> | Book
>>> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
>>>
>>
>>
>

Re: coder evolutions?

Reply via email to