On Thu, Mar 26, 2020 at 10:13 AM Luke Cwik <lc...@google.com> wrote: > The issue seems to be that a PCollection can have a "tag" associated with > it and PTransform expansion can return an arbitrary nested dictionary/tuple > yet we need to figure out what the user wanted as the local name for the > PCollection from all this information. > > Will this break people who rely on the generated PCollection output tags? > One big question is whether a composite transform cares about the name > that is used. For primitive transforms such as ParDo, this is very much a > yes because the pickled code likely references that name in some way. Some > composites could have the same need where the payload that is stored as > part of the composite references these local names and hence we have to > tell people how to instruct the SDK during transform expansion about what > name will be used unambiguously (as long as we document and have tests > around this we can choose from many options). Finally, in the XLang world, > we need to preserve the names that were provided to us and not change them; > which is more about making the Python SDK handle XLang transform expansion > carefully. > > Am I missing edge cases? > Concatenation of strings leads to collisions if the delimiter character is > used within the tags or map keys. You could use an escaping encoding to > guarantee that the concatenation always generates unique names. > > Some alternatives I thought about were: > * Don't allow arbitrary nestings returned during expansion, force > composite transforms to always provide an unambiguous name (either a tuple > with PCollections with unique tags or a dictionary with untagged > PCollections or a singular PCollection (Java and Go SDKs do this)). >
I believe that aligning with Java and Go would be the right way to go here. I don't know if this would limit expressiveness. > * Have a "best" effort naming system (note the example I give can have > many of the "rules" re-ordered) e.g. if all the PCollection tags are unique > then use only them, followed by if a flat dictionary is returned then use > only the keys as names, followed by if a flat tuple is returned then use > indices, and finally fallback to the hierarchical naming scheme. > > > On Tue, Mar 24, 2020 at 1:07 PM Sam Rohde <sro...@google.com> wrote: > >> Hi All, >> >> *Problem* >> I would like to discuss BEAM-9322 >> <https://issues.apache.org/jira/projects/BEAM/issues/BEAM-9322> and the >> correct way to set the output tags of a transform with nested PCollections, >> e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the >> fixing of BEAM-1833 <https://issues.apache.org/jira/browse/BEAM-1833>, >> the Python SDK when applying a PTransform would auto-generate the output >> tags for the output PCollections even if they are manually set by the user: >> >> class MyComposite(beam.PTransform): >> def expand(self, pcoll): >> a = PCollection.from_(pcoll) >> a.tag = 'a' >> >> b = PCollection.from_(pcoll) >> b.tag = 'b' >> return (a, b) >> >> would yield a PTransform with two output PCollection and output tags with >> 'None' and '0' instead of 'a' and 'b'. This was corrected for simple cases >> like this. However, this fails when the PCollections share the same output >> tag (of course). This can happen like so: >> >> class MyComposite(beam.PTransform): >> def expand(self, pcoll): >> partition_1 = beam.Partition(pcoll, ...) >> partition_2 = beam.Partition(pcoll, ...) >> return (partition_1[0], partition_2[0]) >> >> With the new code, this leads to an error because both output >> PCollections have an output tag of '0'. >> >> *Proposal* >> When applying PTransforms to a pipeline (pipeline.py:550) we name the >> PCollections according to their position in the tree concatenated with the >> PCollection tag and a delimiter. From the first example, the output >> PCollections of the applied transform will be: '0.a' and '1.b' because it >> is a tuple of PCollections. In the second example, the outputs should be: >> '0.0' and '1.0'. In the case of a dict of PCollections, it should simply be >> the keys of the dict. >> >> What do you think? Am I missing edge cases? Will this be unexpected to >> users? Will this break people who rely on the generated PCollection output >> tags? >> >> Regards, >> Sam >> >
smime.p7s
Description: S/MIME Cryptographic Signature