The issue seems to be that a PCollection can have a "tag" associated with
it and PTransform expansion can return an arbitrary nested dictionary/tuple
yet we need to figure out what the user wanted as the local name for the
PCollection from all this information.

Will this break people who rely on the generated PCollection output tags?
One big question is whether a composite transform cares about the name that
is used. For primitive transforms such as ParDo, this is very much a yes
because the pickled code likely references that name in some way. Some
composites could have the same need where the payload that is stored as
part of the composite references these local names and hence we have to
tell people how to instruct the SDK during transform expansion about what
name will be used unambiguously (as long as we document and have tests
around this we can choose from many options). Finally, in the XLang world,
we need to preserve the names that were provided to us and not change them;
which is more about making the Python SDK handle XLang transform expansion
carefully.

Am I missing edge cases?
Concatenation of strings leads to collisions if the delimiter character is
used within the tags or map keys. You could use an escaping encoding to
guarantee that the concatenation always generates unique names.

Some alternatives I thought about were:
* Don't allow arbitrary nestings returned during expansion, force composite
transforms to always provide an unambiguous name (either a tuple with
PCollections with unique tags or a dictionary with untagged PCollections or
a singular PCollection (Java and Go SDKs do this)).
* Have a "best" effort naming system (note the example I give can have many
of the "rules" re-ordered) e.g. if all the PCollection tags are unique then
use only them, followed by if a flat dictionary is returned then use only
the keys as names, followed by if a flat tuple is returned then use
indices, and finally fallback to the hierarchical naming scheme.


On Tue, Mar 24, 2020 at 1:07 PM Sam Rohde <sro...@google.com> wrote:

> Hi All,
>
> *Problem*
> I would like to discuss BEAM-9322
> <https://issues.apache.org/jira/projects/BEAM/issues/BEAM-9322> and the
> correct way to set the output tags of a transform with nested PCollections,
> e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the
> fixing of BEAM-1833 <https://issues.apache.org/jira/browse/BEAM-1833>,
> the Python SDK when applying a PTransform would auto-generate the output
> tags for the output PCollections even if they are manually set by the user:
>
> class MyComposite(beam.PTransform):
>   def expand(self, pcoll):
>     a = PCollection.from_(pcoll)
>     a.tag = 'a'
>
>     b = PCollection.from_(pcoll)
>     b.tag = 'b'
>     return (a, b)
>
> would yield a PTransform with two output PCollection and output tags with
> 'None' and '0' instead of 'a' and 'b'. This was corrected for simple cases
> like this. However, this fails when the PCollections share the same output
> tag (of course). This can happen like so:
>
> class MyComposite(beam.PTransform):
>   def expand(self, pcoll):
>     partition_1 = beam.Partition(pcoll, ...)
>     partition_2 = beam.Partition(pcoll, ...)
>     return (partition_1[0], partition_2[0])
>
> With the new code, this leads to an error because both output PCollections
> have an output tag of '0'.
>
> *Proposal*
> When applying PTransforms to a pipeline (pipeline.py:550) we name the
> PCollections according to their position in the tree concatenated with the
> PCollection tag and a delimiter. From the first example, the output
> PCollections of the applied transform will be: '0.a' and '1.b' because it
> is a tuple of PCollections. In the second example, the outputs should be:
> '0.0' and '1.0'. In the case of a dict of PCollections, it should simply be
> the keys of the dict.
>
> What do you think? Am I missing edge cases? Will this be unexpected to
> users? Will this break people who rely on the generated PCollection output
> tags?
>
> Regards,
> Sam
>

Reply via email to