Re: Evolving a Coder for an added field

Robert Bradshaw Mon, 26 Nov 2018 07:10:46 -0800

Modifying an existing coder is a non-starter until we have a versioning
story. Creating an entirely new coder should definitely be possible, and
using it either opt-in or, if a good enough case can be made, possibly even
opt-out could get this unblocked.


On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <jklu...@mozilla.com> wrote:

> Lukasz - Were you able to get any more context on the possibility of
> versioning coders from other folks at Google?
>
> It sounds like adding versioning for coders and/or schemas is potentially
> a large change. At this point, should I just write up some highlights from
> this thread in a JIRA issue for future tracking?
>
> On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <re...@google.com> wrote:
>
>> A few thoughts:
>>
>> 1. I agree with you about coder versioning. The lack of a good story
>> around versioning has been a huge pain here, and it's unfortunate that
>> nobody ever worked on this.
>>
>> 2. I think versioning schemas will be easier than versioning coders
>> (especially for adding new fields). In many cases I suggest we start
>> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
>> we can migrate all of our internal payload to schemas. Schemas support
>> nested fields, repeated fields, and map fields - which can model most thing.
>>
>> 3. There was a Beam proposal for a way to generically handle incompatible
>> schema updates via snapshots. The idea was that such updates can be
>> accompanied by a transform that maps a pipeline snapshot into a new
>> snapshot with the encodings modified.
>>
>> Reuven
>>
>> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <jklu...@mozilla.com> wrote:
>>
>>> Conversation here has fizzled, but sounds like there's basically a
>>> consensus here on a need for a new concept of Coder versioning that's
>>> accessible at the Java level in order to allow an evolution path. Further,
>>> it sounds like my open PR [0] for adding a new field to Metadata is
>>> essentially blocked until we have coder versioning in place.
>>>
>>> Is there any existing documentation of these concepts, or should I go
>>> ahead and file a new Jira issue summarizing the problem? I don't think I
>>> have a comprehensive enough understanding of the Coder machinery to be able
>>> to design a solution, so I'd need to hand this off or simply leave it in
>>> the Jira backlog.
>>>
>>> [0] https://github.com/apache/beam/pull/6914
>>>
>>>
>>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>>
>>>> Yes, a Coder author should be able to register a URN with a mapping
>>>> from (components + payload) -> Coder (and vice versa), and this should
>>>> be more lightweight than manually editing the proto files.
>>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <t...@apache.org> wrote:
>>>> >
>>>> > +1
>>>> >
>>>> > I think that coders should be immutable/versioned. The SDK should
>>>> know about all the available versions and be able to associate the data
>>>> (stream or at rest) with the corresponding coder version via URN. We can
>>>> also look how that is solved elsewhere, for example the Kafka schema
>>>> registry.
>>>> >
>>>> > Today we only have a few URNs for standard coders:
>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>>> >
>>>> > I imagine we will need a coder registry where IOs and users can add
>>>> their versioned coders also?
>>>> >
>>>> > Thanks,
>>>> > Thomas
>>>> >
>>>> >
>>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <j...@nanthrax.net>
>>>> wrote:
>>>> >>
>>>> >> It makes sense to have a more concrete URN including the version.
>>>> >>
>>>> >> Good idea Robert.
>>>> >>
>>>> >> Regards
>>>> >> JB
>>>> >>
>>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>>>> >> > should be able to recognize when a coder (or any other aspect of
>>>> the
>>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we
>>>> remove
>>>> >> > coders from sources/sinks, there's also possibly the expectation
>>>> that
>>>> >> > one should be able to read data from a source written with that
>>>> same
>>>> >> > coder across versions as well.)
>>>> >> >
>>>> >> > I think it really comes down to how coders are named. If we decide
>>>> to
>>>> >> > let coders change arbitrarily between versions, probably the URN
>>>> for
>>>> >> > SerializedJavaCoder should have the SDK version number in it.
>>>> Coders
>>>> >> > that are stable across SDKs can have better, more stable URNs
>>>> defined
>>>> >> > and registered.
>>>> >> >
>>>> >> > I am more OK with changing the registry to infer different coders
>>>> as
>>>> >> > the SDK evolves (which would be detected and manually overwritten
>>>> with
>>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>>> >> > should still be done with caution as it will make upgrading harder.
>>>> >> > Highly composite, experimental coders should possibly be designed
>>>> in
>>>> >> > an intrinsically extensible way.
>>>> >> >
>>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>>> j...@nanthrax.net> wrote:
>>>> >> >>
>>>> >> >> That's really a pita. It's an important and impacting change.
>>>> >> >>
>>>> >> >> I would go to 1.
>>>> >> >>
>>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>>> cherry
>>>> >> >> pick some changes. Using master as LTS release branch won't work
>>>> IMHO.
>>>> >> >>
>>>> >> >> Regards
>>>> >> >> JB
>>>> >> >>
>>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>>> >> >>> For some extra context this change touches more than FileIO, in
>>>> >> >>> reality this will affect updates in any file-based pipelines
>>>> because
>>>> >> >>> the metadata on each file will have now an extra field for the
>>>> >> >>> lastModifiedDate.
>>>> >> >>>
>>>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>>>> Coder
>>>> >> >>> question. Knowing that probably Dataflow is the only one
>>>> affected, I
>>>> >> >>> would like to know what can we do?
>>>> >> >>>
>>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>>> versions
>>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>>> >> >>> discussion)?
>>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a repeated
>>>> >> >>> Matadata object) ? In this case where is the right place to
>>>> identify
>>>> >> >>> and decide what coder to use?
>>>> >> >>>
>>>> >> >>> Other ideas... ?
>>>> >> >>>
>>>> >> >>> Last thing, the link that Luke shared does not seem to work
>>>> (looks
>>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>>> >> >>> interested in the drain/update proposal:
>>>> >> >>>
>>>> >> >>> [2]
>>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <lc...@google.com>
>>>> wrote:
>>>> >> >>>>
>>>> >> >>>> I think the idea is that you would use one coder for paths
>>>> where you don't need this information and would have FileIO provide a
>>>> separate path that uses your updated coder.
>>>> >> >>>> Existing users would not be impacted and users of the new
>>>> FileIO that depend on this information would not be able to have updated
>>>> their pipeline in the first place.
>>>> >> >>>>
>>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>>> break it for existing users though since I don't know how feasible my
>>>> suggestion above is.
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <
>>>> jklu...@mozilla.com> wrote:
>>>> >> >>>>>
>>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>>> >> >>>>>
>>>> >> >>>>> It sounds like there's no explicit user contract about
>>>> evolving Coder classes in the Java SDK and users might reasonably assume
>>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or
>>>> Flink runners might reasonably expect that they can update the Java SDK
>>>> version used in their pipeline when performing an update.
>>>> >> >>>>>
>>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>>> might not be possible except in a major version bump where it's obvious to
>>>> users to expect breaking changes and not to expect an "update" operation to
>>>> work.
>>>> >> >>>>>
>>>> >> >>>>> It's not clear to me what changing the "name" of a coder would
>>>> look like or whether that's a tenable solution here. Would that change be
>>>> able to happen within the SDK itself, or is it something users would need
>>>> to specify?
>>>> >> >>
>>>> >> >> --
>>>> >> >> Jean-Baptiste Onofré
>>>> >> >> jbono...@apache.org
>>>> >> >> http://blog.nanthrax.net
>>>> >> >> Talend - http://www.talend.com
>>>> >>
>>>> >> --
>>>> >> Jean-Baptiste Onofré
>>>> >> jbono...@apache.org
>>>> >> http://blog.nanthrax.net
>>>> >> Talend - http://www.talend.com
>>>>
>>>

Re: Evolving a Coder for an added field

Reply via email to