Conversation here has fizzled, but sounds like there's basically a
consensus here on a need for a new concept of Coder versioning that's
accessible at the Java level in order to allow an evolution path. Further,
it sounds like my open PR [0] for adding a new field to Metadata is
essentially blocked until we have coder versioning in place.

Is there any existing documentation of these concepts, or should I go ahead
and file a new Jira issue summarizing the problem? I don't think I have a
comprehensive enough understanding of the Coder machinery to be able to
design a solution, so I'd need to hand this off or simply leave it in the
Jira backlog.

[0] https://github.com/apache/beam/pull/6914


On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com> wrote:

> Yes, a Coder author should be able to register a URN with a mapping
> from (components + payload) -> Coder (and vice versa), and this should
> be more lightweight than manually editing the proto files.
> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote:
> >
> > +1
> >
> > I think that coders should be immutable/versioned. The SDK should know
> about all the available versions and be able to associate the data (stream
> or at rest) with the corresponding coder version via URN. We can also look
> how that is solved elsewhere, for example the Kafka schema registry.
> >
> > Today we only have a few URNs for standard coders:
> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
> >
> > I imagine we will need a coder registry where IOs and users can add
> their versioned coders also?
> >
> > Thanks,
> > Thomas
> >
> >
> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >>
> >> It makes sense to have a more concrete URN including the version.
> >>
> >> Good idea Robert.
> >>
> >> Regards
> >> JB
> >>
> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
> >> > I think we'll want to allow upgrades across SDK versions. A runner
> >> > should be able to recognize when a coder (or any other aspect of the
> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove
> >> > coders from sources/sinks, there's also possibly the expectation that
> >> > one should be able to read data from a source written with that same
> >> > coder across versions as well.)
> >> >
> >> > I think it really comes down to how coders are named. If we decide to
> >> > let coders change arbitrarily between versions, probably the URN for
> >> > SerializedJavaCoder should have the SDK version number in it. Coders
> >> > that are stable across SDKs can have better, more stable URNs defined
> >> > and registered.
> >> >
> >> > I am more OK with changing the registry to infer different coders as
> >> > the SDK evolves (which would be detected and manually overwritten with
> >> > the old ones, on a case-by-case basis, if they still exist). This
> >> > should still be done with caution as it will make upgrading harder.
> >> > Highly composite, experimental coders should possibly be designed in
> >> > an intrinsically extensible way.
> >> >
> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
> >> >>
> >> >> That's really a pita. It's an important and impacting change.
> >> >>
> >> >> I would go to 1.
> >> >>
> >> >> For LTS, as already said, I would create a LTS branch and only cherry
> >> >> pick some changes. Using master as LTS release branch won't work
> IMHO.
> >> >>
> >> >> Regards
> >> >> JB
> >> >>
> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
> >> >>> For some extra context this change touches more than FileIO, in
> >> >>> reality this will affect updates in any file-based pipelines because
> >> >>> the metadata on each file will have now an extra field for the
> >> >>> lastModifiedDate.
> >> >>>
> >> >>> The PR looks perfect, only issue is the backwards compatibility
> Coder
> >> >>> question. Knowing that probably Dataflow is the only one affected, I
> >> >>> would like to know what can we do?
> >> >>>
> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
> versions
> >> >>> (which makes sense and is probably more aligned with the LTS
> >> >>> discussion)?
> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
> >> >>> Matadata object) ? In this case where is the right place to identify
> >> >>> and decide what coder to use?
> >> >>>
> >> >>> Other ideas... ?
> >> >>>
> >> >>> Last thing, the link that Luke shared does not seem to work (looks
> >> >>> like a googley-friendly URL, here it is the full URL for those
> >> >>> interested in the drain/update proposal:
> >> >>>
> >> >>> [2]
> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
> wrote:
> >> >>>>
> >> >>>> I think the idea is that you would use one coder for paths where
> you don't need this information and would have FileIO provide a separate
> path that uses your updated coder.
> >> >>>> Existing users would not be impacted and users of the new FileIO
> that depend on this information would not be able to have updated their
> pipeline in the first place.
> >> >>>>
> >> >>>> If the feature in FileIO is experimental, we could choose to break
> it for existing users though since I don't know how feasible my suggestion
> above is.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jklu...@mozilla.com>
> wrote:
> >> >>>>>
> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
> >> >>>>>
> >> >>>>> It sounds like there's no explicit user contract about evolving
> Coder classes in the Java SDK and users might reasonably assume Coders to
> be stable between SDK versions. Thus, users of the Dataflow or Flink
> runners might reasonably expect that they can update the Java SDK version
> used in their pipeline when performing an update.
> >> >>>>>
> >> >>>>> Based in that understanding, evolving a class like Metadata might
> not be possible except in a major version bump where it's obvious to users
> to expect breaking changes and not to expect an "update" operation to work.
> >> >>>>>
> >> >>>>> It's not clear to me what changing the "name" of a coder would
> look like or whether that's a tenable solution here. Would that change be
> able to happen within the SDK itself, or is it something users would need
> to specify?
> >> >>
> >> >> --
> >> >> Jean-Baptiste Onofré
> >> >> jbono...@apache.org
> >> >> http://blog.nanthrax.net
> >> >> Talend - http://www.talend.com
> >>
> >> --
> >> Jean-Baptiste Onofré
> >> jbono...@apache.org
> >> http://blog.nanthrax.net
> >> Talend - http://www.talend.com
>

Reply via email to