Modifying an existing coder is a non-starter until we have a versioning story. Creating an entirely new coder should definitely be possible, and using it either opt-in or, if a good enough case can be made, possibly even opt-out could get this unblocked.
On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <jklu...@mozilla.com> wrote: > Lukasz - Were you able to get any more context on the possibility of > versioning coders from other folks at Google? > > It sounds like adding versioning for coders and/or schemas is potentially > a large change. At this point, should I just write up some highlights from > this thread in a JIRA issue for future tracking? > > On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote: > >> A few thoughts: >> >> 1. I agree with you about coder versioning. The lack of a good story >> around versioning has been a huge pain here, and it's unfortunate that >> nobody ever worked on this. >> >> 2. I think versioning schemas will be easier than versioning coders >> (especially for adding new fields). In many cases I suggest we start >> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe >> we can migrate all of our internal payload to schemas. Schemas support >> nested fields, repeated fields, and map fields - which can model most thing. >> >> 3. There was a Beam proposal for a way to generically handle incompatible >> schema updates via snapshots. The idea was that such updates can be >> accompanied by a transform that maps a pipeline snapshot into a new >> snapshot with the encodings modified. >> >> Reuven >> >> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jklu...@mozilla.com> wrote: >> >>> Conversation here has fizzled, but sounds like there's basically a >>> consensus here on a need for a new concept of Coder versioning that's >>> accessible at the Java level in order to allow an evolution path. Further, >>> it sounds like my open PR [0] for adding a new field to Metadata is >>> essentially blocked until we have coder versioning in place. >>> >>> Is there any existing documentation of these concepts, or should I go >>> ahead and file a new Jira issue summarizing the problem? I don't think I >>> have a comprehensive enough understanding of the Coder machinery to be able >>> to design a solution, so I'd need to hand this off or simply leave it in >>> the Jira backlog. >>> >>> [0] https://github.com/apache/beam/pull/6914 >>> >>> >>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com> >>> wrote: >>> >>>> Yes, a Coder author should be able to register a URN with a mapping >>>> from (components + payload) -> Coder (and vice versa), and this should >>>> be more lightweight than manually editing the proto files. >>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote: >>>> > >>>> > +1 >>>> > >>>> > I think that coders should be immutable/versioned. The SDK should >>>> know about all the available versions and be able to associate the data >>>> (stream or at rest) with the corresponding coder version via URN. We can >>>> also look how that is solved elsewhere, for example the Kafka schema >>>> registry. >>>> > >>>> > Today we only have a few URNs for standard coders: >>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617 >>>> > >>>> > I imagine we will need a coder registry where IOs and users can add >>>> their versioned coders also? >>>> > >>>> > Thanks, >>>> > Thomas >>>> > >>>> > >>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>> >> >>>> >> It makes sense to have a more concrete URN including the version. >>>> >> >>>> >> Good idea Robert. >>>> >> >>>> >> Regards >>>> >> JB >>>> >> >>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote: >>>> >> > I think we'll want to allow upgrades across SDK versions. A runner >>>> >> > should be able to recognize when a coder (or any other aspect of >>>> the >>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we >>>> remove >>>> >> > coders from sources/sinks, there's also possibly the expectation >>>> that >>>> >> > one should be able to read data from a source written with that >>>> same >>>> >> > coder across versions as well.) >>>> >> > >>>> >> > I think it really comes down to how coders are named. If we decide >>>> to >>>> >> > let coders change arbitrarily between versions, probably the URN >>>> for >>>> >> > SerializedJavaCoder should have the SDK version number in it. >>>> Coders >>>> >> > that are stable across SDKs can have better, more stable URNs >>>> defined >>>> >> > and registered. >>>> >> > >>>> >> > I am more OK with changing the registry to infer different coders >>>> as >>>> >> > the SDK evolves (which would be detected and manually overwritten >>>> with >>>> >> > the old ones, on a case-by-case basis, if they still exist). This >>>> >> > should still be done with caution as it will make upgrading harder. >>>> >> > Highly composite, experimental coders should possibly be designed >>>> in >>>> >> > an intrinsically extensible way. >>>> >> > >>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré < >>>> j...@nanthrax.net> wrote: >>>> >> >> >>>> >> >> That's really a pita. It's an important and impacting change. >>>> >> >> >>>> >> >> I would go to 1. >>>> >> >> >>>> >> >> For LTS, as already said, I would create a LTS branch and only >>>> cherry >>>> >> >> pick some changes. Using master as LTS release branch won't work >>>> IMHO. >>>> >> >> >>>> >> >> Regards >>>> >> >> JB >>>> >> >> >>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote: >>>> >> >>> For some extra context this change touches more than FileIO, in >>>> >> >>> reality this will affect updates in any file-based pipelines >>>> because >>>> >> >>> the metadata on each file will have now an extra field for the >>>> >> >>> lastModifiedDate. >>>> >> >>> >>>> >> >>> The PR looks perfect, only issue is the backwards compatibility >>>> Coder >>>> >> >>> question. Knowing that probably Dataflow is the only one >>>> affected, I >>>> >> >>> would like to know what can we do? >>>> >> >>> >>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK >>>> versions >>>> >> >>> (which makes sense and is probably more aligned with the LTS >>>> >> >>> discussion)? >>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated >>>> >> >>> Matadata object) ? In this case where is the right place to >>>> identify >>>> >> >>> and decide what coder to use? >>>> >> >>> >>>> >> >>> Other ideas... ? >>>> >> >>> >>>> >> >>> Last thing, the link that Luke shared does not seem to work >>>> (looks >>>> >> >>> like a googley-friendly URL, here it is the full URL for those >>>> >> >>> interested in the drain/update proposal: >>>> >> >>> >>>> >> >>> [2] >>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit# >>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com> >>>> wrote: >>>> >> >>>> >>>> >> >>>> I think the idea is that you would use one coder for paths >>>> where you don't need this information and would have FileIO provide a >>>> separate path that uses your updated coder. >>>> >> >>>> Existing users would not be impacted and users of the new >>>> FileIO that depend on this information would not be able to have updated >>>> their pipeline in the first place. >>>> >> >>>> >>>> >> >>>> If the feature in FileIO is experimental, we could choose to >>>> break it for existing users though since I don't know how feasible my >>>> suggestion above is. >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> >>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas < >>>> jklu...@mozilla.com> wrote: >>>> >> >>>>> >>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context. >>>> >> >>>>> >>>> >> >>>>> It sounds like there's no explicit user contract about >>>> evolving Coder classes in the Java SDK and users might reasonably assume >>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or >>>> Flink runners might reasonably expect that they can update the Java SDK >>>> version used in their pipeline when performing an update. >>>> >> >>>>> >>>> >> >>>>> Based in that understanding, evolving a class like Metadata >>>> might not be possible except in a major version bump where it's obvious to >>>> users to expect breaking changes and not to expect an "update" operation to >>>> work. >>>> >> >>>>> >>>> >> >>>>> It's not clear to me what changing the "name" of a coder would >>>> look like or whether that's a tenable solution here. Would that change be >>>> able to happen within the SDK itself, or is it something users would need >>>> to specify? >>>> >> >> >>>> >> >> -- >>>> >> >> Jean-Baptiste Onofré >>>> >> >> jbono...@apache.org >>>> >> >> http://blog.nanthrax.net >>>> >> >> Talend - http://www.talend.com >>>> >> >>>> >> -- >>>> >> Jean-Baptiste Onofré >>>> >> jbono...@apache.org >>>> >> http://blog.nanthrax.net >>>> >> Talend - http://www.talend.com >>>> >>>