Re: [BEAM-9322] Python SDK discussion on correct output tag names

Sam Rohde Wed, 01 Apr 2020 13:48:17 -0700

To restate the original issue it is that the current method of setting the
output tags on PCollections from composites drops the tag information of
the returned PCollections. So an expand returning a dict will have its
outputs labeled as None, 1, ..., len(outputs). This is broken because
embedded payloads in composites won't be able to reference the outputs if
they differ from the returned value.


In this case, having the restriction of no nesting decreases technical
complexity substantially and always giving the tag unambiguously informs
the SDK what to name the outputs. We can also allow for arbitrary nesting,
we'll just have to figure out an unambiguous naming scheme for the
PCollections.



On Wed, Apr 1, 2020 at 12:44 PM Robert Bradshaw <rober...@google.com> wrote:

> I'm -1 on this, it adds additional restrictions and I don't see what this
> buys us. (In particular, it doesn't address the original issue.)
>
> On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde <sro...@google.com> wrote:
>
>> So then how does the proposal sound?
>>
>> Writing again here:
>> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection],
>> Tuple[str, PCollection], Dict[str, PCollection], DoOutputsTuple]
>>
>> i.e. no arbitrary nesting when outputting from an expand
>>
>> On Tue, Mar 31, 2020 at 5:15 PM Robert Bradshaw <rober...@google.com>
>> wrote:
>>
>>> On Tue, Mar 31, 2020 at 4:13 PM Luke Cwik <lc...@google.com> wrote:
>>> >
>>> > It is important that composites know how things are named so that any
>>> embedded payloads in the composite PTransform can reference the outputs
>>> appropriately.
>>>
>>> Very good point. This is part of the cleanup to treat inputs and
>>> outputs of PCollections as maps rather than lists generally across the
>>> Python representations (which also relates to some of the ugliness
>>> that Cham has been running into with cross-language).
>>>
>>> > On Tue, Mar 31, 2020 at 2:51 PM Robert Bradshaw <rober...@google.com>
>>> wrote:
>>> >>
>>> >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde <sro...@google.com> wrote:
>>> >> >>>
>>> >> >>> * Don't allow arbitrary nestings returned during expansion, force
>>> composite transforms to always provide an unambiguous name (either a tuple
>>> with PCollections with unique tags or a dictionary with untagged
>>> PCollections or a singular PCollection (Java and Go SDKs do this)).
>>> >> >>
>>> >> >> I believe that aligning with Java and Go would be the right way to
>>> go here. I don't know if this would limit expressiveness.
>>> >> >
>>> >> > Yeah this sounds like a much more elegant way of handling this
>>> situation. I would lean towards this limiting expressiveness because there
>>> would be a limit to nesting, but I think that the trade-off with reducing
>>> complexity is worth it.
>>> >> >
>>> >> > So in summary it could be:
>>> >> > PTransform.expand: (...) -> Union[PValue, NamedTuple[str,
>>> PCollection], Tuple[str, PCollection], Dict[str, PCollection],
>>> DoOutputsTuple]
>>> >> >
>>> >> > With the expectation that (pseudo-code):
>>> >> > a_transform = ATransform()
>>> >> >
>>> ATransform.from_runner_api(a_transform.to_runner_api()).outputs.keys() ==
>>> a_transform.outputs.keys()
>>> >> >
>>> >> > Since this changes the Python SDK composite transform API, what
>>> would be the next steps for the community to come to a consensus on this?
>>> >>
>>> >> It seems here we're conflating the nesting of PValue results with the
>>> >> nesting of composite operations.
>>> >>
>>> >> Both examples in the original post have PTransform nesting (a
>>> >> composite) returning a flat tuple. This is completely orthogonal to
>>> >> the idea of a PTransform returning a nested result (such as (pc1,
>>> >> (pc2, pc3))) and forbidding the latter won't solve the former.
>>> >>
>>> >> Currently, with the exception of explicit names given for multi-output
>>> >> ParDos, we simply label the outputs sequentially with 0, 1, 2, 3, ...
>>> >> (Actually, for historical reasons, it's None, 1, 2, 3, ...), no matter
>>> >> the nesting. We could do better, e.g. for the example above, label
>>> >> them "0", "1.0", "1.1", or use the keys in the returned dict, but this
>>> >> is separate from the idea of trying to relate the output tags of
>>> >> composites to the output tags of their inner transforms.
>>> >>
>>> >> - Robert
>>>
>>

Re: [BEAM-9322] Python SDK discussion on correct output tag names

Reply via email to