Reuven was one of the people I reached out to on this matter and he replied on this thread.
On Mon, Nov 26, 2018 at 7:07 AM Robert Bradshaw <rober...@google.com> wrote: > Modifying an existing coder is a non-starter until we have a versioning > story. Creating an entirely new coder should definitely be possible, and > using it either opt-in or, if a good enough case can be made, possibly even > opt-out could get this unblocked. > > On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <jklu...@mozilla.com> wrote: > >> Lukasz - Were you able to get any more context on the possibility of >> versioning coders from other folks at Google? >> >> It sounds like adding versioning for coders and/or schemas is potentially >> a large change. At this point, should I just write up some highlights from >> this thread in a JIRA issue for future tracking? >> >> On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote: >> >>> A few thoughts: >>> >>> 1. I agree with you about coder versioning. The lack of a good story >>> around versioning has been a huge pain here, and it's unfortunate that >>> nobody ever worked on this. >>> >>> 2. I think versioning schemas will be easier than versioning coders >>> (especially for adding new fields). In many cases I suggest we start >>> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe >>> we can migrate all of our internal payload to schemas. Schemas support >>> nested fields, repeated fields, and map fields - which can model most thing. >>> >>> 3. There was a Beam proposal for a way to generically handle >>> incompatible schema updates via snapshots. The idea was that such updates >>> can be accompanied by a transform that maps a pipeline snapshot into a new >>> snapshot with the encodings modified. >>> >>> Reuven >>> >>> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jklu...@mozilla.com> wrote: >>> >>>> Conversation here has fizzled, but sounds like there's basically a >>>> consensus here on a need for a new concept of Coder versioning that's >>>> accessible at the Java level in order to allow an evolution path. Further, >>>> it sounds like my open PR [0] for adding a new field to Metadata is >>>> essentially blocked until we have coder versioning in place. >>>> >>>> Is there any existing documentation of these concepts, or should I go >>>> ahead and file a new Jira issue summarizing the problem? I don't think I >>>> have a comprehensive enough understanding of the Coder machinery to be able >>>> to design a solution, so I'd need to hand this off or simply leave it in >>>> the Jira backlog. >>>> >>>> [0] https://github.com/apache/beam/pull/6914 >>>> >>>> >>>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com> >>>> wrote: >>>> >>>>> Yes, a Coder author should be able to register a URN with a mapping >>>>> from (components + payload) -> Coder (and vice versa), and this should >>>>> be more lightweight than manually editing the proto files. >>>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote: >>>>> > >>>>> > +1 >>>>> > >>>>> > I think that coders should be immutable/versioned. The SDK should >>>>> know about all the available versions and be able to associate the data >>>>> (stream or at rest) with the corresponding coder version via URN. We can >>>>> also look how that is solved elsewhere, for example the Kafka schema >>>>> registry. >>>>> > >>>>> > Today we only have a few URNs for standard coders: >>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617 >>>>> > >>>>> > I imagine we will need a coder registry where IOs and users can add >>>>> their versioned coders also? >>>>> > >>>>> > Thanks, >>>>> > Thomas >>>>> > >>>>> > >>>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>>>> wrote: >>>>> >> >>>>> >> It makes sense to have a more concrete URN including the version. >>>>> >> >>>>> >> Good idea Robert. >>>>> >> >>>>> >> Regards >>>>> >> JB >>>>> >> >>>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote: >>>>> >> > I think we'll want to allow upgrades across SDK versions. A runner >>>>> >> > should be able to recognize when a coder (or any other aspect of >>>>> the >>>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we >>>>> remove >>>>> >> > coders from sources/sinks, there's also possibly the expectation >>>>> that >>>>> >> > one should be able to read data from a source written with that >>>>> same >>>>> >> > coder across versions as well.) >>>>> >> > >>>>> >> > I think it really comes down to how coders are named. If we >>>>> decide to >>>>> >> > let coders change arbitrarily between versions, probably the URN >>>>> for >>>>> >> > SerializedJavaCoder should have the SDK version number in it. >>>>> Coders >>>>> >> > that are stable across SDKs can have better, more stable URNs >>>>> defined >>>>> >> > and registered. >>>>> >> > >>>>> >> > I am more OK with changing the registry to infer different coders >>>>> as >>>>> >> > the SDK evolves (which would be detected and manually overwritten >>>>> with >>>>> >> > the old ones, on a case-by-case basis, if they still exist). This >>>>> >> > should still be done with caution as it will make upgrading >>>>> harder. >>>>> >> > Highly composite, experimental coders should possibly be designed >>>>> in >>>>> >> > an intrinsically extensible way. >>>>> >> > >>>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré < >>>>> j...@nanthrax.net> wrote: >>>>> >> >> >>>>> >> >> That's really a pita. It's an important and impacting change. >>>>> >> >> >>>>> >> >> I would go to 1. >>>>> >> >> >>>>> >> >> For LTS, as already said, I would create a LTS branch and only >>>>> cherry >>>>> >> >> pick some changes. Using master as LTS release branch won't work >>>>> IMHO. >>>>> >> >> >>>>> >> >> Regards >>>>> >> >> JB >>>>> >> >> >>>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote: >>>>> >> >>> For some extra context this change touches more than FileIO, in >>>>> >> >>> reality this will affect updates in any file-based pipelines >>>>> because >>>>> >> >>> the metadata on each file will have now an extra field for the >>>>> >> >>> lastModifiedDate. >>>>> >> >>> >>>>> >> >>> The PR looks perfect, only issue is the backwards compatibility >>>>> Coder >>>>> >> >>> question. Knowing that probably Dataflow is the only one >>>>> affected, I >>>>> >> >>> would like to know what can we do? >>>>> >> >>> >>>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK >>>>> versions >>>>> >> >>> (which makes sense and is probably more aligned with the LTS >>>>> >> >>> discussion)? >>>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a >>>>> repeated >>>>> >> >>> Matadata object) ? In this case where is the right place to >>>>> identify >>>>> >> >>> and decide what coder to use? >>>>> >> >>> >>>>> >> >>> Other ideas... ? >>>>> >> >>> >>>>> >> >>> Last thing, the link that Luke shared does not seem to work >>>>> (looks >>>>> >> >>> like a googley-friendly URL, here it is the full URL for those >>>>> >> >>> interested in the drain/update proposal: >>>>> >> >>> >>>>> >> >>> [2] >>>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit# >>>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> >>>>> wrote: >>>>> >> >>>> >>>>> >> >>>> I think the idea is that you would use one coder for paths >>>>> where you don't need this information and would have FileIO provide a >>>>> separate path that uses your updated coder. >>>>> >> >>>> Existing users would not be impacted and users of the new >>>>> FileIO that depend on this information would not be able to have updated >>>>> their pipeline in the first place. >>>>> >> >>>> >>>>> >> >>>> If the feature in FileIO is experimental, we could choose to >>>>> break it for existing users though since I don't know how feasible my >>>>> suggestion above is. >>>>> >> >>>> >>>>> >> >>>> >>>>> >> >>>> >>>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas < >>>>> jklu...@mozilla.com> wrote: >>>>> >> >>>>> >>>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context. >>>>> >> >>>>> >>>>> >> >>>>> It sounds like there's no explicit user contract about >>>>> evolving Coder classes in the Java SDK and users might reasonably assume >>>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or >>>>> Flink runners might reasonably expect that they can update the Java SDK >>>>> version used in their pipeline when performing an update. >>>>> >> >>>>> >>>>> >> >>>>> Based in that understanding, evolving a class like Metadata >>>>> might not be possible except in a major version bump where it's obvious to >>>>> users to expect breaking changes and not to expect an "update" operation >>>>> to >>>>> work. >>>>> >> >>>>> >>>>> >> >>>>> It's not clear to me what changing the "name" of a coder >>>>> would look like or whether that's a tenable solution here. Would that >>>>> change be able to happen within the SDK itself, or is it something users >>>>> would need to specify? >>>>> >> >> >>>>> >> >> -- >>>>> >> >> Jean-Baptiste Onofré >>>>> >> >> jbono...@apache.org >>>>> >> >> http://blog.nanthrax.net >>>>> >> >> Talend - http://www.talend.com >>>>> >> >>>>> >> -- >>>>> >> Jean-Baptiste Onofré >>>>> >> jbono...@apache.org >>>>> >> http://blog.nanthrax.net >>>>> >> Talend - http://www.talend.com >>>>> >>>>