Lukasz - Were you able to get any more context on the possibility of
versioning coders from other folks at Google?

It sounds like adding versioning for coders and/or schemas is potentially a
large change. At this point, should I just write up some highlights from
this thread in a JIRA issue for future tracking?

On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote:

> A few thoughts:
>
> 1. I agree with you about coder versioning. The lack of a good story
> around versioning has been a huge pain here, and it's unfortunate that
> nobody ever worked on this.
>
> 2. I think versioning schemas will be easier than versioning coders
> (especially for adding new fields). In many cases I suggest we start
> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
> we can migrate all of our internal payload to schemas. Schemas support
> nested fields, repeated fields, and map fields - which can model most thing.
>
> 3. There was a Beam proposal for a way to generically handle incompatible
> schema updates via snapshots. The idea was that such updates can be
> accompanied by a transform that maps a pipeline snapshot into a new
> snapshot with the encodings modified.
>
> Reuven
>
> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jklu...@mozilla.com> wrote:
>
>> Conversation here has fizzled, but sounds like there's basically a
>> consensus here on a need for a new concept of Coder versioning that's
>> accessible at the Java level in order to allow an evolution path. Further,
>> it sounds like my open PR [0] for adding a new field to Metadata is
>> essentially blocked until we have coder versioning in place.
>>
>> Is there any existing documentation of these concepts, or should I go
>> ahead and file a new Jira issue summarizing the problem? I don't think I
>> have a comprehensive enough understanding of the Coder machinery to be able
>> to design a solution, so I'd need to hand this off or simply leave it in
>> the Jira backlog.
>>
>> [0] https://github.com/apache/beam/pull/6914
>>
>>
>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> Yes, a Coder author should be able to register a URN with a mapping
>>> from (components + payload) -> Coder (and vice versa), and this should
>>> be more lightweight than manually editing the proto files.
>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote:
>>> >
>>> > +1
>>> >
>>> > I think that coders should be immutable/versioned. The SDK should know
>>> about all the available versions and be able to associate the data (stream
>>> or at rest) with the corresponding coder version via URN. We can also look
>>> how that is solved elsewhere, for example the Kafka schema registry.
>>> >
>>> > Today we only have a few URNs for standard coders:
>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>> >
>>> > I imagine we will need a coder registry where IOs and users can add
>>> their versioned coders also?
>>> >
>>> > Thanks,
>>> > Thomas
>>> >
>>> >
>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>> wrote:
>>> >>
>>> >> It makes sense to have a more concrete URN including the version.
>>> >>
>>> >> Good idea Robert.
>>> >>
>>> >> Regards
>>> >> JB
>>> >>
>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>>> >> > should be able to recognize when a coder (or any other aspect of the
>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove
>>> >> > coders from sources/sinks, there's also possibly the expectation
>>> that
>>> >> > one should be able to read data from a source written with that same
>>> >> > coder across versions as well.)
>>> >> >
>>> >> > I think it really comes down to how coders are named. If we decide
>>> to
>>> >> > let coders change arbitrarily between versions, probably the URN for
>>> >> > SerializedJavaCoder should have the SDK version number in it. Coders
>>> >> > that are stable across SDKs can have better, more stable URNs
>>> defined
>>> >> > and registered.
>>> >> >
>>> >> > I am more OK with changing the registry to infer different coders as
>>> >> > the SDK evolves (which would be detected and manually overwritten
>>> with
>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>> >> > should still be done with caution as it will make upgrading harder.
>>> >> > Highly composite, experimental coders should possibly be designed in
>>> >> > an intrinsically extensible way.
>>> >> >
>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>> j...@nanthrax.net> wrote:
>>> >> >>
>>> >> >> That's really a pita. It's an important and impacting change.
>>> >> >>
>>> >> >> I would go to 1.
>>> >> >>
>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>> cherry
>>> >> >> pick some changes. Using master as LTS release branch won't work
>>> IMHO.
>>> >> >>
>>> >> >> Regards
>>> >> >> JB
>>> >> >>
>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>> >> >>> For some extra context this change touches more than FileIO, in
>>> >> >>> reality this will affect updates in any file-based pipelines
>>> because
>>> >> >>> the metadata on each file will have now an extra field for the
>>> >> >>> lastModifiedDate.
>>> >> >>>
>>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>>> Coder
>>> >> >>> question. Knowing that probably Dataflow is the only one
>>> affected, I
>>> >> >>> would like to know what can we do?
>>> >> >>>
>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>> versions
>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>> >> >>> discussion)?
>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>>> >> >>> Matadata object) ? In this case where is the right place to
>>> identify
>>> >> >>> and decide what coder to use?
>>> >> >>>
>>> >> >>> Other ideas... ?
>>> >> >>>
>>> >> >>> Last thing, the link that Luke shared does not seem to work (looks
>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>> >> >>> interested in the drain/update proposal:
>>> >> >>>
>>> >> >>> [2]
>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>>> wrote:
>>> >> >>>>
>>> >> >>>> I think the idea is that you would use one coder for paths where
>>> you don't need this information and would have FileIO provide a separate
>>> path that uses your updated coder.
>>> >> >>>> Existing users would not be impacted and users of the new FileIO
>>> that depend on this information would not be able to have updated their
>>> pipeline in the first place.
>>> >> >>>>
>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>> break it for existing users though since I don't know how feasible my
>>> suggestion above is.
>>> >> >>>>
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jklu...@mozilla.com>
>>> wrote:
>>> >> >>>>>
>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>> >> >>>>>
>>> >> >>>>> It sounds like there's no explicit user contract about evolving
>>> Coder classes in the Java SDK and users might reasonably assume Coders to
>>> be stable between SDK versions. Thus, users of the Dataflow or Flink
>>> runners might reasonably expect that they can update the Java SDK version
>>> used in their pipeline when performing an update.
>>> >> >>>>>
>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>> might not be possible except in a major version bump where it's obvious to
>>> users to expect breaking changes and not to expect an "update" operation to
>>> work.
>>> >> >>>>>
>>> >> >>>>> It's not clear to me what changing the "name" of a coder would
>>> look like or whether that's a tenable solution here. Would that change be
>>> able to happen within the SDK itself, or is it something users would need
>>> to specify?
>>> >> >>
>>> >> >> --
>>> >> >> Jean-Baptiste Onofré
>>> >> >> jbono...@apache.org
>>> >> >> http://blog.nanthrax.net
>>> >> >> Talend - http://www.talend.com
>>> >>
>>> >> --
>>> >> Jean-Baptiste Onofré
>>> >> jbono...@apache.org
>>> >> http://blog.nanthrax.net
>>> >> Talend - http://www.talend.com
>>>
>>

Reply via email to