Re: coder evolutions?

Reuven Lax Sun, 04 Feb 2018 10:56:20 -0800

Do we have a good place to track the items for Beam 3.0, or is Jira the
best place? Romain has a good point - if this gets forgotten when we do
Beam 3.0, then we're stuck waiting around till Beam 4.0.


Reuven

On Sun, Feb 4, 2018 at 9:27 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

> That's a good point. In the roadmap for Beam 3, I think it makes sense to
> add a
> point about this.
>
> Regards
> JB
>
> On 02/04/2018 06:18 PM, Eugene Kirpichov wrote:
> > I think doing a change that would break pipeline update for every single
> user of
> > Flink and Dataflow needs to be postponed until a next major version.
> Pipeline
> > update is a very frequently used feature, especially by the largest
> users. We've
> > had those users get significantly upset even when we accidentally broke
> update
> > compatibility for some special cases of individual transforms; breaking
> it
> > intentionally and project-wide is too extreme to be justified by the
> benefits of
> > the current change.
> >
> > That said, I think concerns about coder APIs are reasonable, and it is
> > unfortunate that we effectively can't make changes to them right now. It
> would
> > be great if in the next major version we were better prepared for
> evolution of
> > coders, e.g. by having coders support a version marker or something like
> that,
> > with an API for detecting the version of data on wire and reading or
> writing
> > data of an old version. Such a change (introducing versioning) would
> also, of
> > course, be incompatible and would need to be postponed until a major
> version -
> > but, at least, subsequent changes wouldn't.
> >
> > ...And as I was typing this email, seems that this is what the thread
> already
> > came to!
> >
> > On Sun, Feb 4, 2018 at 9:16 AM Romain Manni-Bucau <[email protected]
> > <mailto:[email protected]>> wrote:
> >
> >     I like this idea of migration support at coder level. It would
> require to
> >     add a metadata in all outputs which would represent the version then
> coders
> >     can handle the logic properly depending the version - we can assume
> a coder
> >     dev upgrade the version when he breaks the representation I hope ;).
> >     With this: no runner impact at all :).
> >
> >
> >     Romain Manni-Bucau
> >     @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >     <https://rmannibucau.metawerx.net/> | Old Blog
> >     <http://rmannibucau.wordpress.com> | Github
> >     <https://github.com/rmannibucau> | LinkedIn
> >     <https://www.linkedin.com/in/rmannibucau> | Book
> >     <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >
> >     2018-02-04 18:09 GMT+01:00 Reuven Lax <[email protected]
> >     <mailto:[email protected]>>:
> >
> >         It would already break quite a number of users at this point.
> >
> >         I think what we should be doing is moving forward on the
> snapshot/update
> >         proposal. That proposal actually provides a way forward when
> coders
> >         change (it proposes a way to map an old snapshot to one using
> the new
> >         coder, so changes to coders in the future will be much easier to
> make.
> >         However much of the implementation for this will likely be at
> the runner
> >         level, not the SDK level.
> >
> >         Reuven
> >
> >         On Sun, Feb 4, 2018 at 9:04 AM, Romain Manni-Bucau
> >         <[email protected] <mailto:[email protected]>> wrote:
> >
> >             I fully understand that, and this is one of the reason
> managing to
> >             solve these issues is very important and ASAP. My conclusion
> is that
> >             we must break it now to avoid to do it later when usage will
> be way
> >             more developped - I would be very happy to be wrong on that
> point -
> >             so I started this PR and this thread. We can postpone it but
> it
> >             would break later so for probably more users.
> >
> >
> >             Romain Manni-Bucau
> >             @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> >             <https://rmannibucau.metawerx.net/> | Old Blog
> >             <http://rmannibucau.wordpress.com> | Github
> >             <https://github.com/rmannibucau> | LinkedIn
> >             <https://www.linkedin.com/in/rmannibucau> | Book
> >             <https://www.packtpub.com/application-development/java-
> ee-8-high-performance>
> >
> >             2018-02-04 17:49 GMT+01:00 Reuven Lax <[email protected]
> >             <mailto:[email protected]>>:
> >
> >                 Unfortunately several runners (at least Flink and
> Dataflow)
> >                 support in-place update of streaming pipelines as a key
> feature,
> >                 and changing coder format breaks this. This is a very
> important
> >                 feature of both runners, and we should endeavor not to
> break them.
> >
> >                 In-place snapshot and update is also a top-level Beam
> proposal
> >                 that was received positively, though neither of those
> runners
> >                 yet implement the proposed interface.
> >
> >                 Reuven
> >
> >                 On Sun, Feb 4, 2018 at 8:44 AM, Romain Manni-Bucau
> >                 <[email protected] <mailto:[email protected]>>
> wrote:
> >
> >                     Sadly yes, and why the PR is actually WIP. As
> mentionned it
> >                     modifies it and requires some updates in other
> languages and
> >                     the standard_coders.yml file (I didn't find how this
> file
> >                     was generated).
> >                     Since coders must be about volatile data I don't
> think it is
> >                     a big deal to change it though.
> >
> >
> >                     Romain Manni-Bucau
> >                     @rmannibucau <https://twitter.com/rmannibucau> |
>  Blog
> >                     <https://rmannibucau.metawerx.net/> | Old Blog
> >                     <http://rmannibucau.wordpress.com> | Github
> >                     <https://github.com/rmannibucau> | LinkedIn
> >                     <https://www.linkedin.com/in/rmannibucau> | Book
> >                     <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>
> >
> >                     2018-02-04 17:34 GMT+01:00 Reuven Lax <
> [email protected]
> >                     <mailto:[email protected]>>:
> >
> >                         One question - does this change the actual byte
> encoding
> >                         of elements? We've tried hard not to do that so
> far for
> >                         reasons of compatibility.
> >
> >                         Reuven
> >
> >                         On Sun, Feb 4, 2018 at 6:44 AM, Romain
> Manni-Bucau
> >                         <[email protected] <mailto:
> [email protected]>>
> >                         wrote:
> >
> >                             Hi guys,
> >
> >                             I submitted a PR on coders to enhance 1. the
> user
> >                             experience 2. the determinism and handling
> of coders.
> >
> >                             1. the user experience is linked to what i
> sent some
> >                             days ago: close handling of the streams from
> a coder
> >                             code. Long story short I add a
> SkipCloseCoder which
> >                             can decorate a coder and just wraps the
> stream
> >                             (input or output) in flavors skipping
> close() calls.
> >                             This avoids to do it by default (which had my
> >                             preference if you read the related thread
> but not
> >                             the one of everybody) but also makes the
> usage of a
> >                             coder with this issue easy since the of() of
> the
> >                             coder just wraps itself in this delagating
> coder.
> >
> >                             2. this one is more nasty and mainly concerns
> >                             IterableLikeCoders. These ones use this kind
> of
> >                             algorithm (keep in mind they work on a list):
> >
> >                             writeSize()
> >                             for all element e {
> >                                 elementCoder.write(e)
> >                             }
> >                             writeMagicNumber() // this one depends the
> size
> >
> >                             The decoding is symmetric so I bypass it
> here.
> >
> >                             Indeed all these writes (reads) are done on
> the same
> >                             stream. Therefore it assumes you read as
> much bytes
> >                             than you write...which is a huge assumption
> for a
> >                             coder which should by contract assume it can
> read
> >                             the stream...as a stream (until -1).
> >
> >                             The idea of the fix is to change this
> encoding to
> >                             this kind of algorithm:
> >
> >                             writeSize()
> >                             for all element e {
> >                                 writeElementByteCount(e)
> >                                 elementCoder.write(e)
> >                             }
> >                             writeMagicNumber() // still optionally
> >
> >                             This way on the decode size you can wrap the
> stream
> >                             by element to enforce the limitation of the
> byte count.
> >
> >                             Side note: this indeed enforce a limitation
> due to
> >                             java byte limitation but if you check coder
> code it
> >                             is already here at the higher level so it is
> not a
> >                             big deal for now.
> >
> >                             In terms of implementation it uses a
> >                             LengthAwareCoder which delegates to another
> coder
> >                             the encoding and just adds the byte count
> before the
> >                             actual serialization. Not perfect but should
> be more
> >                             than enough in terms of support and perf for
> beam if
> >                             you think real pipelines (we try to avoid
> >                             serializations or it is done on some well
> known
> >                             points where this algo should be
> enough...worse case
> >                             it is not a huge overhead, mainly just some
> memory
> >                             overhead).
> >
> >
> >                             The PR is available
> >                             at https://github.com/apache/beam/pull/4594.
> If you
> >                             check you will see I put it "WIP". The main
> reason
> >                             is that it changes the encoding format for
> >                             containers (lists, iterable, ...) and
> therefore
> >                             breaks python/go/... tests and the
> >                             standard_coders.yml definition. Some help on
> that
> >                             would be very welcomed.
> >
> >                             Technical side note if you
> >                             wonder: UnownedInputStream doesn't even
> allow to
> >                             mark the stream so there is no real fast way
> to read
> >                             the stream as fast as possible with standard
> >                             buffering strategies and to support this
> automatic
> >                             IterableCoder wrapping which is implicit. In
> other
> >                             words, if beam wants to support any coder,
> including
> >                             the ones not requiring to write the size of
> the
> >                             output - most of the codecs - then we need
> to change
> >                             the way it works to something like that
> which does
> >                             it for the user which doesn't know its coder
> got
> >                             wrapped.
> >
> >                             Hope it makes sense, if not, don't hesitate
> to ask
> >                             questions.
> >
> >                             Happy end of week-end.
> >
> >                             Romain Manni-Bucau
> >                             @rmannibucau <https://twitter.com/
> rmannibucau> |
> >                              Blog <https://rmannibucau.metawerx.net/> |
> Old Blog
> >                             <http://rmannibucau.wordpress.com> | Github
> >                             <https://github.com/rmannibucau> | LinkedIn
> >                             <https://www.linkedin.com/in/rmannibucau> |
> Book
> >                             <https://www.packtpub.com/
> application-development/java-ee-8-high-performance>
> >
> >
> >
> >
> >
> >
> >
>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: coder evolutions?

Reply via email to