Re: Evolving a Coder for an added field

Lukasz Cwik Mon, 26 Nov 2018 08:48:11 -0800

Reuven was one of the people I reached out to on this matter and he replied
on this thread.


On Mon, Nov 26, 2018 at 7:07 AM Robert Bradshaw <[email protected]> wrote:

> Modifying an existing coder is a non-starter until we have a versioning
> story. Creating an entirely new coder should definitely be possible, and
> using it either opt-in or, if a good enough case can be made, possibly even
> opt-out could get this unblocked.
>
> On Mon, Nov 26, 2018 at 3:05 PM Jeff Klukas <[email protected]> wrote:
>
>> Lukasz - Were you able to get any more context on the possibility of
>> versioning coders from other folks at Google?
>>
>> It sounds like adding versioning for coders and/or schemas is potentially
>> a large change. At this point, should I just write up some highlights from
>> this thread in a JIRA issue for future tracking?
>>
>> On Mon, Nov 12, 2018 at 8:23 PM Reuven Lax <[email protected]> wrote:
>>
>>> A few thoughts:
>>>
>>> 1. I agree with you about coder versioning. The lack of a good story
>>> around versioning has been a huge pain here, and it's unfortunate that
>>> nobody ever worked on this.
>>>
>>> 2. I think versioning schemas will be easier than versioning coders
>>> (especially for adding new fields). In many cases I suggest we start
>>> looking at migrating as much as possible to schemas, and in Beam 3.0 maybe
>>> we can migrate all of our internal payload to schemas. Schemas support
>>> nested fields, repeated fields, and map fields - which can model most thing.
>>>
>>> 3. There was a Beam proposal for a way to generically handle
>>> incompatible schema updates via snapshots. The idea was that such updates
>>> can be accompanied by a transform that maps a pipeline snapshot into a new
>>> snapshot with the encodings modified.
>>>
>>> Reuven
>>>
>>> On Tue, Nov 13, 2018 at 3:16 AM Jeff Klukas <[email protected]> wrote:
>>>
>>>> Conversation here has fizzled, but sounds like there's basically a
>>>> consensus here on a need for a new concept of Coder versioning that's
>>>> accessible at the Java level in order to allow an evolution path. Further,
>>>> it sounds like my open PR [0] for adding a new field to Metadata is
>>>> essentially blocked until we have coder versioning in place.
>>>>
>>>> Is there any existing documentation of these concepts, or should I go
>>>> ahead and file a new Jira issue summarizing the problem? I don't think I
>>>> have a comprehensive enough understanding of the Coder machinery to be able
>>>> to design a solution, so I'd need to hand this off or simply leave it in
>>>> the Jira backlog.
>>>>
>>>> [0] https://github.com/apache/beam/pull/6914
>>>>
>>>>
>>>> On Tue, Nov 6, 2018 at 4:38 AM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes, a Coder author should be able to register a URN with a mapping
>>>>> from (components + payload) -> Coder (and vice versa), and this should
>>>>> be more lightweight than manually editing the proto files.
>>>>> On Mon, Nov 5, 2018 at 7:12 PM Thomas Weise <[email protected]> wrote:
>>>>> >
>>>>> > +1
>>>>> >
>>>>> > I think that coders should be immutable/versioned. The SDK should
>>>>> know about all the available versions and be able to associate the data
>>>>> (stream or at rest) with the corresponding coder version via URN. We can
>>>>> also look how that is solved elsewhere, for example the Kafka schema
>>>>> registry.
>>>>> >
>>>>> > Today we only have a few URNs for standard coders:
>>>>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/beam_runner_api.proto#L617
>>>>> >
>>>>> > I imagine we will need a coder registry where IOs and users can add
>>>>> their versioned coders also?
>>>>> >
>>>>> > Thanks,
>>>>> > Thomas
>>>>> >
>>>>> >
>>>>> > On Mon, Nov 5, 2018 at 7:54 AM Jean-Baptiste Onofré <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >> It makes sense to have a more concrete URN including the version.
>>>>> >>
>>>>> >> Good idea Robert.
>>>>> >>
>>>>> >> Regards
>>>>> >> JB
>>>>> >>
>>>>> >> On 05/11/2018 16:52, Robert Bradshaw wrote:
>>>>> >> > I think we'll want to allow upgrades across SDK versions. A runner
>>>>> >> > should be able to recognize when a coder (or any other aspect of
>>>>> the
>>>>> >> > pipeline) has changed and adapt/reject accordingly. (Until we
>>>>> remove
>>>>> >> > coders from sources/sinks, there's also possibly the expectation
>>>>> that
>>>>> >> > one should be able to read data from a source written with that
>>>>> same
>>>>> >> > coder across versions as well.)
>>>>> >> >
>>>>> >> > I think it really comes down to how coders are named. If we
>>>>> decide to
>>>>> >> > let coders change arbitrarily between versions, probably the URN
>>>>> for
>>>>> >> > SerializedJavaCoder should have the SDK version number in it.
>>>>> Coders
>>>>> >> > that are stable across SDKs can have better, more stable URNs
>>>>> defined
>>>>> >> > and registered.
>>>>> >> >
>>>>> >> > I am more OK with changing the registry to infer different coders
>>>>> as
>>>>> >> > the SDK evolves (which would be detected and manually overwritten
>>>>> with
>>>>> >> > the old ones, on a case-by-case basis, if they still exist). This
>>>>> >> > should still be done with caution as it will make upgrading
>>>>> harder.
>>>>> >> > Highly composite, experimental coders should possibly be designed
>>>>> in
>>>>> >> > an intrinsically extensible way.
>>>>> >> >
>>>>> >> > On Mon, Nov 5, 2018 at 4:24 PM Jean-Baptiste Onofré <
>>>>> [email protected]> wrote:
>>>>> >> >>
>>>>> >> >> That's really a pita. It's an important and impacting change.
>>>>> >> >>
>>>>> >> >> I would go to 1.
>>>>> >> >>
>>>>> >> >> For LTS, as already said, I would create a LTS branch and only
>>>>> cherry
>>>>> >> >> pick some changes. Using master as LTS release branch won't work
>>>>> IMHO.
>>>>> >> >>
>>>>> >> >> Regards
>>>>> >> >> JB
>>>>> >> >>
>>>>> >> >> On 05/11/2018 15:47, Ismaël Mejía wrote:
>>>>> >> >>> For some extra context this change touches more than FileIO, in
>>>>> >> >>> reality this will affect updates in any file-based pipelines
>>>>> because
>>>>> >> >>> the metadata on each file will have now an extra field for the
>>>>> >> >>> lastModifiedDate.
>>>>> >> >>>
>>>>> >> >>> The PR looks perfect, only issue is the backwards compatibility
>>>>> Coder
>>>>> >> >>> question. Knowing that probably Dataflow is the only one
>>>>> affected, I
>>>>> >> >>> would like to know what can we do?
>>>>> >> >>>
>>>>> >> >>> [1] Should we merge and the Coder updatability be tied to SDK
>>>>> versions
>>>>> >> >>> (which makes sense and is probably more aligned with the LTS
>>>>> >> >>> discussion)?
>>>>> >> >>> [2] Should we have a MetadataCoderV2? (does this imply a
>>>>> repeated
>>>>> >> >>> Matadata object) ? In this case where is the right place to
>>>>> identify
>>>>> >> >>> and decide what coder to use?
>>>>> >> >>>
>>>>> >> >>> Other ideas... ?
>>>>> >> >>>
>>>>> >> >>> Last thing, the link that Luke shared does not seem to work
>>>>> (looks
>>>>> >> >>> like a googley-friendly URL, here it is the full URL for those
>>>>> >> >>> interested in the drain/update proposal:
>>>>> >> >>>
>>>>> >> >>> [2]
>>>>> https://docs.google.com/document/d/1UWhnYPgui0gUYOsuGcCjLuoOUlGA4QaY91n8p3wz9MY/edit#
>>>>> >> >>> On Fri, Nov 2, 2018 at 10:11 PM Lukasz Cwik <[email protected]>
>>>>> wrote:
>>>>> >> >>>>
>>>>> >> >>>> I think the idea is that you would use one coder for paths
>>>>> where you don't need this information and would have FileIO provide a
>>>>> separate path that uses your updated coder.
>>>>> >> >>>> Existing users would not be impacted and users of the new
>>>>> FileIO that depend on this information would not be able to have updated
>>>>> their pipeline in the first place.
>>>>> >> >>>>
>>>>> >> >>>> If the feature in FileIO is experimental, we could choose to
>>>>> break it for existing users though since I don't know how feasible my
>>>>> suggestion above is.
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>> On Fri, Nov 2, 2018 at 12:56 PM Jeff Klukas <
>>>>> [email protected]> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> Lukasz - Thanks for those links. That's very helpful context.
>>>>> >> >>>>>
>>>>> >> >>>>> It sounds like there's no explicit user contract about
>>>>> evolving Coder classes in the Java SDK and users might reasonably assume
>>>>> Coders to be stable between SDK versions. Thus, users of the Dataflow or
>>>>> Flink runners might reasonably expect that they can update the Java SDK
>>>>> version used in their pipeline when performing an update.
>>>>> >> >>>>>
>>>>> >> >>>>> Based in that understanding, evolving a class like Metadata
>>>>> might not be possible except in a major version bump where it's obvious to
>>>>> users to expect breaking changes and not to expect an "update" operation 
>>>>> to
>>>>> work.
>>>>> >> >>>>>
>>>>> >> >>>>> It's not clear to me what changing the "name" of a coder
>>>>> would look like or whether that's a tenable solution here. Would that
>>>>> change be able to happen within the SDK itself, or is it something users
>>>>> would need to specify?
>>>>> >> >>
>>>>> >> >> --
>>>>> >> >> Jean-Baptiste Onofré
>>>>> >> >> [email protected]
>>>>> >> >> http://blog.nanthrax.net
>>>>> >> >> Talend - http://www.talend.com
>>>>> >>
>>>>> >> --
>>>>> >> Jean-Baptiste Onofré
>>>>> >> [email protected]
>>>>> >> http://blog.nanthrax.net
>>>>> >> Talend - http://www.talend.com
>>>>>
>>>>

Re: Evolving a Coder for an added field

Reply via email to