Lukasz - Were you able to get any more context on the possibility of versioning coders from other folks at Google?
It sounds like adding versioning for coders and/or schemas is potentially a large change. At this point, should I just write up some highlights from this thread in a JIRA issue for future tracking? On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote: > A few thoughts: > > 1. I agree with you about coder versioning. The lack of a good story > around versioning has been a huge pain here, and it's unfortunate that > nobody ever worked on this. > > 2. I think versioning schemas will be easier than versioning coders > (especially for adding new fields). In many cases I suggest we start > looking at migrating as much as possible to schemas, and in Beam 3.0 maybe > we can migrate all of our internal payload to schemas. Schemas support > nested fields, repeated fields, and map fields - which can model most thing. > > 3. There was a Beam proposal for a way to generically handle incompatible > schema updates via snapshots. The idea was that such updates can be > accompanied by a transform that maps a pipeline snapshot into a new > snapshot with the encodings modified. > > Reuven > > On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jklu...@mozilla.com> wrote: > >> Conversation here has fizzled, but sounds like there's basically a >> consensus here on a need for a new concept of Coder versioning that's >> accessible at the Java level in order to allow an evolution path. Further, >> it sounds like my open PR [0] for adding a new field to Metadata is >> essentially blocked until we have coder versioning in place. >> >> Is there any existing documentation of these concepts, or should I go >> ahead and file a new Jira issue summarizing the problem? I don't think I >> have a comprehensive enough understanding of the Coder machinery to be able >> to design a solution, so I'd need to hand this off or simply leave it in >> the Jira backlog. >> >> [0] https://github.com/apache/beam/pull/6914 >> >> >> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> Yes, a Coder author should be able to register a URN with a mapping >>> from (components + payload) -> Coder (and vice versa), and this should >>> be more lightweight than manually editing the proto files. >>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote: >>> > >>> > +1 >>> > >>> > I think that coders should be immutable/versioned. The SDK should know >>> about all the available versions and be able to associate the data (stream >>> or at rest) with the corresponding coder version via URN. We can also look >>> how that is solved elsewhere, for example the Kafka schema registry. >>> > >>> > Today we only have a few URNs for standard coders: >>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617 >>> > >>> > I imagine we will need a coder registry where IOs and users can add >>> their versioned coders also? >>> > >>> > Thanks, >>> > Thomas >>> > >>> > >>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>> wrote: >>> >> >>> >> It makes sense to have a more concrete URN including the version. >>> >> >>> >> Good idea Robert. >>> >> >>> >> Regards >>> >> JB >>> >> >>> >> On 05/11/2018 16:52, Robert Bradshaw wrote: >>> >> > I think we'll want to allow upgrades across SDK versions. A runner >>> >> > should be able to recognize when a coder (or any other aspect of the >>> >> > pipeline) has changed and adapt/reject accordingly. (Until we remove >>> >> > coders from sources/sinks, there's also possibly the expectation >>> that >>> >> > one should be able to read data from a source written with that same >>> >> > coder across versions as well.) >>> >> > >>> >> > I think it really comes down to how coders are named. If we decide >>> to >>> >> > let coders change arbitrarily between versions, probably the URN for >>> >> > SerializedJavaCoder should have the SDK version number in it. Coders >>> >> > that are stable across SDKs can have better, more stable URNs >>> defined >>> >> > and registered. >>> >> > >>> >> > I am more OK with changing the registry to infer different coders as >>> >> > the SDK evolves (which would be detected and manually overwritten >>> with >>> >> > the old ones, on a case-by-case basis, if they still exist). This >>> >> > should still be done with caution as it will make upgrading harder. >>> >> > Highly composite, experimental coders should possibly be designed in >>> >> > an intrinsically extensible way. >>> >> > >>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré < >>> j...@nanthrax.net> wrote: >>> >> >> >>> >> >> That's really a pita. It's an important and impacting change. >>> >> >> >>> >> >> I would go to 1. >>> >> >> >>> >> >> For LTS, as already said, I would create a LTS branch and only >>> cherry >>> >> >> pick some changes. Using master as LTS release branch won't work >>> IMHO. >>> >> >> >>> >> >> Regards >>> >> >> JB >>> >> >> >>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote: >>> >> >>> For some extra context this change touches more than FileIO, in >>> >> >>> reality this will affect updates in any file-based pipelines >>> because >>> >> >>> the metadata on each file will have now an extra field for the >>> >> >>> lastModifiedDate. >>> >> >>> >>> >> >>> The PR looks perfect, only issue is the backwards compatibility >>> Coder >>> >> >>> question. Knowing that probably Dataflow is the only one >>> affected, I >>> >> >>> would like to know what can we do? >>> >> >>> >>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK >>> versions >>> >> >>> (which makes sense and is probably more aligned with the LTS >>> >> >>> discussion)? >>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated >>> >> >>> Matadata object) ? In this case where is the right place to >>> identify >>> >> >>> and decide what coder to use? >>> >> >>> >>> >> >>> Other ideas... ? >>> >> >>> >>> >> >>> Last thing, the link that Luke shared does not seem to work (looks >>> >> >>> like a googley-friendly URL, here it is the full URL for those >>> >> >>> interested in the drain/update proposal: >>> >> >>> >>> >> >>> [2] >>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit# >>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> >>> wrote: >>> >> >>>> >>> >> >>>> I think the idea is that you would use one coder for paths where >>> you don't need this information and would have FileIO provide a separate >>> path that uses your updated coder. >>> >> >>>> Existing users would not be impacted and users of the new FileIO >>> that depend on this information would not be able to have updated their >>> pipeline in the first place. >>> >> >>>> >>> >> >>>> If the feature in FileIO is experimental, we could choose to >>> break it for existing users though since I don't know how feasible my >>> suggestion above is. >>> >> >>>> >>> >> >>>> >>> >> >>>> >>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <jklu...@mozilla.com> >>> wrote: >>> >> >>>>> >>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context. >>> >> >>>>> >>> >> >>>>> It sounds like there's no explicit user contract about evolving >>> Coder classes in the Java SDK and users might reasonably assume Coders to >>> be stable between SDK versions. Thus, users of the Dataflow or Flink >>> runners might reasonably expect that they can update the Java SDK version >>> used in their pipeline when performing an update. >>> >> >>>>> >>> >> >>>>> Based in that understanding, evolving a class like Metadata >>> might not be possible except in a major version bump where it's obvious to >>> users to expect breaking changes and not to expect an "update" operation to >>> work. >>> >> >>>>> >>> >> >>>>> It's not clear to me what changing the "name" of a coder would >>> look like or whether that's a tenable solution here. Would that change be >>> able to happen within the SDK itself, or is it something users would need >>> to specify? >>> >> >> >>> >> >> -- >>> >> >> Jean-Baptiste Onofré >>> >> >> jbono...@apache.org >>> >> >> http://blog.nanthrax.net >>> >> >> Talend - http://www.talend.com >>> >> >>> >> -- >>> >> Jean-Baptiste Onofré >>> >> jbono...@apache.org >>> >> http://blog.nanthrax.net >>> >> Talend - http://www.talend.com >>> >>