Conversation here has fizzled, but sounds like there's basically a consensus here on a need for a new concept of Coder versioning that's accessible at the Java level in order to allow an evolution path. Further, it sounds like my open PR [0] for adding a new field to Metadata is essentially blocked until we have coder versioning in place.
Is there any existing documentation of these concepts, or should I go ahead and file a new Jira issue summarizing the problem? I don't think I have a comprehensive enough understanding of the Coder machinery to be able to design a solution, so I'd need to hand this off or simply leave it in the Jira backlog. [0] https://github.com/apache/beam/pull/6914 On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com> wrote: > Yes, a Coder author should be able to register a URN with a mapping > from (components + payload) -> Coder (and vice versa), and this should > be more lightweight than manually editing the proto files. > On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote: > > > > +1 > > > > I think that coders should be immutable/versioned. The SDK should know > about all the available versions and be able to associate the data (stream > or at rest) with the corresponding coder version via URN. We can also look > how that is solved elsewhere, for example the Kafka schema registry. > > > > Today we only have a few URNs for standard coders: > https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617 > > > > I imagine we will need a coder registry where IOs and users can add > their versioned coders also? > > > > Thanks, > > Thomas > > > > > > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> > >> It makes sense to have a more concrete URN including the version. > >> > >> Good idea Robert. > >> > >> Regards > >> JB > >> > >> On 05/11/2018 16:52, Robert Bradshaw wrote: > >> > I think we'll want to allow upgrades across SDK versions. A runner > >> > should be able to recognize when a coder (or any other aspect of the > >> > pipeline) has changed and adapt/reject accordingly. (Until we remove > >> > coders from sources/sinks, there's also possibly the expectation that > >> > one should be able to read data from a source written with that same > >> > coder across versions as well.) > >> > > >> > I think it really comes down to how coders are named. If we decide to > >> > let coders change arbitrarily between versions, probably the URN for > >> > SerializedJavaCoder should have the SDK version number in it. Coders > >> > that are stable across SDKs can have better, more stable URNs defined > >> > and registered. > >> > > >> > I am more OK with changing the registry to infer different coders as > >> > the SDK evolves (which would be detected and manually overwritten with > >> > the old ones, on a case-by-case basis, if they still exist). This > >> > should still be done with caution as it will make upgrading harder. > >> > Highly composite, experimental coders should possibly be designed in > >> > an intrinsically extensible way. > >> > > >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <j...@nanthrax.net> > wrote: > >> >> > >> >> That's really a pita. It's an important and impacting change. > >> >> > >> >> I would go to 1. > >> >> > >> >> For LTS, as already said, I would create a LTS branch and only cherry > >> >> pick some changes. Using master as LTS release branch won't work > IMHO. > >> >> > >> >> Regards > >> >> JB > >> >> > >> >> On 05/11/2018 15:47, Ismaël Mejía wrote: > >> >>> For some extra context this change touches more than FileIO, in > >> >>> reality this will affect updates in any file-based pipelines because > >> >>> the metadata on each file will have now an extra field for the > >> >>> lastModifiedDate. > >> >>> > >> >>> The PR looks perfect, only issue is the backwards compatibility > Coder > >> >>> question. Knowing that probably Dataflow is the only one affected, I > >> >>> would like to know what can we do? > >> >>> > >> >>> [1] Should we merge and the Coder updatability be tied to SDK > versions > >> >>> (which makes sense and is probably more aligned with the LTS > >> >>> discussion)? > >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated > >> >>> Matadata object) ? In this case where is the right place to identify > >> >>> and decide what coder to use? > >> >>> > >> >>> Other ideas... ? > >> >>> > >> >>> Last thing, the link that Luke shared does not seem to work (looks > >> >>> like a googley-friendly URL, here it is the full URL for those > >> >>> interested in the drain/update proposal: > >> >>> > >> >>> [2] > https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit# > >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> > wrote: > >> >>>> > >> >>>> I think the idea is that you would use one coder for paths where > you don't need this information and would have FileIO provide a separate > path that uses your updated coder. > >> >>>> Existing users would not be impacted and users of the new FileIO > that depend on this information would not be able to have updated their > pipeline in the first place. > >> >>>> > >> >>>> If the feature in FileIO is experimental, we could choose to break > it for existing users though since I don't know how feasible my suggestion > above is. > >> >>>> > >> >>>> > >> >>>> > >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jklu...@mozilla.com> > wrote: > >> >>>>> > >> >>>>> Lukasz - Thanks for those links. That's very helpful context. > >> >>>>> > >> >>>>> It sounds like there's no explicit user contract about evolving > Coder classes in the Java SDK and users might reasonably assume Coders to > be stable between SDK versions. Thus, users of the Dataflow or Flink > runners might reasonably expect that they can update the Java SDK version > used in their pipeline when performing an update. > >> >>>>> > >> >>>>> Based in that understanding, evolving a class like Metadata might > not be possible except in a major version bump where it's obvious to users > to expect breaking changes and not to expect an "update" operation to work. > >> >>>>> > >> >>>>> It's not clear to me what changing the "name" of a coder would > look like or whether that's a tenable solution here. Would that change be > able to happen within the SDK itself, or is it something users would need > to specify? > >> >> > >> >> -- > >> >> Jean-Baptiste Onofré > >> >> jbono...@apache.org > >> >> http://blog.nanthrax.net > >> >> Talend - http://www.talend.com > >> > >> -- > >> Jean-Baptiste Onofré > >> jbono...@apache.org > >> http://blog.nanthrax.net > >> Talend - http://www.talend.com >