A few months back there was a discussion[1] about performing work to
stabilize the protos used for pipeline execution looking forward to cross
language pipelines and runners who want to use them across SDK versions
(Dataflow).

All the proposed incompatible clean-up tasks were done and made it into
2.21 (there are some left related to documentation and cleaning up some
stuff that can be removed in a backwards compatible way and general
re-organization within the files to delineate what is stable and what
isn't).

Beyond documenting the versioning story (sketch below) in a more durable
location then this ML, performing these last clean-up tasks and general
re-organization within the files, is there anything else that should be
done before we can vote and consider the protos to be stable (which would
mean that 2.21 would contain the first stable version assuming no other
incompatible changes are suggested)?

The versioning story is around 3 parts and effectively occurs whenever
there is an incompatible change such as:
* adding a new field that didn't exist where it semantically changes what
is to be done
* removing a field that was effectively required
* requiring an SDK or runner to behave differently (e.g. support large
iterables, support a new API (such as a future map state for StatefulDoFns))
The three ways of handling versioning for incompatible changes are:
* many protos have URNs, when there is an incompatible change the URN
should be changed. If it is effectively the same thing then this should
lead to a version bump and update of the documentation reflecting what the
requirements of the new version are.
* there is a capabilities section on each environment, this should
enumerate everything the SDK can support, protocols (e.g. large iterables,
...), coders, well known transforms, ...
* there is a requirements section on the pipeline proto, this is an
enumeration of everything the SDK needs the runner to know to be able to
interpret the pipeline (e.g. splittable dofn, requires time sorted input,
...).

Updating the URN of the transform/coder is typically the easiest way to
handle incompatible changes followed by using the capabilities list to
enable new things (used like an allowlist) and the requirements list to
prevent runners from doing things they shouldn't (used like a denylist).
Many features/APIs that are part of the initial version are implicitly not
in either the capabilities or requirements lists to prevent a huge
definition list and can be disabled in the future by relying on adding
requirements that disable these currently unnamed features/APIs if it is
ever necessary.

1:
https://lists.apache.org/thread.html/rdf247cfa3a509f80578f03b2454ea1e50474ee3576a059486d58fdf4%40%3Cdev.beam.apache.org%3E

Reply via email to