One thing that doesn't appear to have been suggested yet is we could
"batch" urns together under a "super urn" so that adding one super urn is
like adding each of the represented batch of features. This prevents
needing to send dozens of urns to be individually sent over.


The super urns would need to be static after definition to avoid mismatched
definitions down the road.

We collect together urns what is reasonably consider "vX" support, and can
then increment that later.

This would simplify new SDKs, as they can have a goal of initial v1 support
as we define what level of feature support it has, and doesn't prevent new
capabilities from being added incrementally.

Similarly, certain features sets could stand alone, eg around SQL. It's
benefitial for optimization reasons if an SDK has native projection and UDF
support for example, which a runner could take advantage of by avoiding
extra cross language hops. These could then also be grouped under a SQL
super urn.

This is from the SDK capability side of course, rather than the SDK
pipeline requirements side.

-------
Related to that last point, it might be good to nail down early the
perspective used when discussing these things, as there's a dual between
"what and SDK can do", and "what the runner will do to a pipeline that the
SDK can understand" (eg. Combiner lifting, and state backed iterables), as
well as "what the pipeline requires from the runner" and "what the runner
is able to do" (eg. Requires sorted input)


On Thu, Feb 13, 2020, 9:06 AM Luke Cwik <lc...@google.com> wrote:

>
>
> On Wed, Feb 12, 2020 at 2:24 PM Kenneth Knowles <k...@apache.org> wrote:
>
>>
>>
>> On Wed, Feb 12, 2020 at 12:04 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Wed, Feb 12, 2020 at 11:08 AM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > We can always detect on the runner/SDK side whether there is an
>>> unknown field[1] within a payload and fail to process it but this is
>>> painful in two situations:
>>> > 1) It doesn't provide for a good error message since you can't say
>>> what the purpose of the field is. With a capability URN, the runner/SDK
>>> could say which URN it doesn't understand.
>>> > 2) It doesn't allow for the addition of fields which don't impact
>>> semantics of execution. For example, if the display data feature was being
>>> developed, a runner could ignore it and still execute the pipeline
>>> correctly.
>>>
>>> Yeah, I don't think proto reflection is a flexible enough tool to do
>>> this well either.
>>>
>>> > If we think this to be common enough, we can add capabilities list to
>>> the PTransform so each PTransform can do this and has a natural way of
>>> being extended for additions which are forwards compatible. The alternative
>>> to having capabilities on PTransform (and other constructs) is that we
>>> would have a new URN when the specification of the transform changes. For
>>> forwards compatible changes, each SDK/runner would map older versions of
>>> the URN onto the latest and internally treat it as the latest version but
>>> always downgrade it to the version the other party expects when
>>> communicating with it. Backwards incompatible changes would always require
>>> a new URN which capabilities at the PTransform level would not help with.
>>>
>>> As you point out, stateful+splittable may not be a particularly useful
>>> combination, but as another example, we have
>>> (backwards-incompatible-when-introduced) markers on DoFn as to whether
>>> it requires finalization, stable inputs, and now time sorting. I don't
>>> think we should have a new URN for each combination.
>>>
>>
>> Agree with this. I don't think stateful, splittable, and "plain" ParDo
>> are comparable to these. Each is an entirely different computational
>> paradigm: per-element independent processing, per-key-and-window linear
>> processing, and per-element-and-restriction splittable processing. Most
>> relevant IMO is the nature of the parallelism. If you added state to
>> splittable processing, it would still be splittable processing. Just as
>> Combine and ParDo can share the SideInput specification, it is easy to
>> share relevant sub-structures like state declarations. But it is a fair
>> point that the ability to split can be ignored and run as a plain-old
>> ParDo. It brings up the question of whether a runner that doesn't know SDF
>> is should have to reject it or should be allowed to run poorly.
>>
>
> Being splittable means that the SDK could choose to return a continuation
> saying please process the rest of my element in X amount of time which
> would require the runner to inspect certain fields on responses. One
> example would be I don't have many more messages to read from this message
> stream at the moment and another example could be that I detected that this
> filesystem is throttling me or is down and I would like to resume
> processing later.
>
>
>> It isn't a huge deal. Three different top-level URNS versus three
>> different sub-URNs will achieve the same result in the end if we get this
>> "capability" thing in place.
>>
>> Kenn
>>
>>
>>>
>>> >> > I do think that splittable ParDo and stateful ParDo should have
>>> separate PTransform URNs since they are different paradigms than "vanilla"
>>> ParDo.
>>> >>
>>> >> Here I disagree. What about one that is both splittable and stateful?
>>> Would one have a fourth URN for that? If/when another flavor of DoFn comes
>>> out, would we then want 8 distinct URNs? (SplitableParDo in particular can
>>> be executed as a normal ParDo as long as the output is bounded.)
>>> >
>>> > I agree that you could have stateful and splittable dofns where the
>>> element is the key and you share state and timers across restrictions. No
>>> runner is capable of executing this efficiently.
>>> >
>>> >> >> > On the SDK requirements side: the constructing SDK owns the
>>> Environment proto completely, so it is in a position to ensure the involved
>>> docker images support the necessary features.
>>> >> >>
>>> >> >> Yes.
>>> >
>>> >
>>> > I believe capabilities do exist on a Pipeline and it informs runners
>>> about new types of fields to be aware of either within Components or on the
>>> Pipeline object itself but for this discussion it makes sense that an
>>> environment would store most "capabilities" related to execution.
>>> >
>>> >> [snip]
>>> >
>>> > As for the proto clean-ups, the scope is to cover almost all things
>>> needed for execution now and to follow-up with optional transforms,
>>> payloads, and coders later which would exclude job managment APIs and
>>> artifact staging. A formal enumeration would be useful here. Also, we
>>> should provide formal guidance about adding new fields, adding new types of
>>> transforms, new types of proto messages, ... (best to describe this on a
>>> case by case basis as to how people are trying to modify the protos and
>>> evolve this guidance over time).
>>>
>>> What we need is the ability for (1) runners to reject future pipelines
>>> they cannot faithfully execute and (2) runners to be able to take
>>> advantage of advanced features/protocols when interacting with those
>>> SDKs that understand them while avoiding them for older (or newer)
>>> SDKs that don't. Let's call (1) (hard) requirements and (2) (optional)
>>> capabilities.
>>>
>>> Where possible, I think this is best expressed inherently in the set
>>> of transform (and possibly other component) URNs. For example, when an
>>> SDK uses a combine_per_key composite, that's a signal that it
>>> understands the various related combine_* transforms. Similarly, a
>>> pipeline with a test_stream URN would be rejected by pipelines not
>>> recognizing/supporting this primitive. However, this is not always
>>> possible, e.g. for (1) we have the aforementioned boolean flags on
>>> ParDo and for (2) we have features like large iterable and progress
>>> support.
>>>
>>> For (1) we have to enumerate now everywhere a runner must look a far
>>> into the future as we want to remain backwards compatible. This is why
>>> I suggested putting something on the pipeline itself, but we could
>>> (likely in addition) add it to Transform and/or ParDoPayload if we
>>> think that'd be useful now. (Note that a future pipeline-level
>>> requirement could be "inspect (previously non-existent) requirements
>>> field attached to objects of type X.")
>>>
>>> For (2) I think adding a capabilities field to the environment for now
>>> makes the most sense, and as it's optional to inspect them adding it
>>> elsewhere if needed is backwards compatible. (The motivation to do it
>>> now is that there are some capabilities that we'd like to enumerate
>>> now rather than make part of the minimal set of things an SDK must
>>> support.)
>>>
>>>
> Agree on the separation of requirements from capabilities where
> requirements is a set of MUST understand while capabilities are a set of
> MAY understand.
>
>
>> > All in all, I think "capabilities" is about informing a runner about
>>> what they should know about and what they are allowed to do. If we go with
>>> a list of "capabilities", we could always add a "parameterized
>>> capabilities" urn which would tell runners they need to also look at some
>>> other field.
>>>
>>> Good point. That lets us keep it as a list for now. (The risk is that
>>> it makes possible the bug of populating parameters without adding the
>>> required notification to the list.)
>>>
>>> > I also believe capabilities should NOT be "inherited". For example if
>>> we define capabilities on a ParDoPayload, and on a PTransform and on
>>> Environment, then ParDoPayload capabilities shouldn't be copied to
>>> PTransform and PTransform specific capabilities shouldn't be copied to the
>>> Environment. My reasoning about this is that some "capabilities" can only
>>> be scoped to a single ParDoPayload or a single PTransform and wouldn't
>>> apply generally everywhere. The best example I could think of is that
>>> Environment A supports progress reporting while Environment B doesn't so it
>>> wouldn't have made sense to say the "Pipeline" supports progress reporting.
>>> >
>>> > Are capabilities strictly different from "resources" (transform needs
>>> python package X) or "execution hints" (e.g. deploy on machines that have
>>> GPUs, some generic but mostly runner specific hints)? At first glance I
>>> would say yes.
>>>
>>> Agreed.
>>>
>>

Reply via email to