Sam Rohde created BEAM-9322:
-------------------------------

             Summary: Python SDK ignores manually set PCollection tags
                 Key: BEAM-9322
                 URL: https://issues.apache.org/jira/browse/BEAM-9322
             Project: Beam
          Issue Type: Bug
          Components: sdk-py-core
            Reporter: Sam Rohde
            Assignee: Sam Rohde


The Python SDK currently ignores any tags set on PCollections manually when 
applying PTransforms when adding the PCollection to the PTransform 
[outputs|[https://github.com/apache/beam/blob/688a4ea53f315ec2aa2d37602fd78496fca8bb4f/sdks/python/apache_beam/pipeline.py#L595]].
 In the 
[add_output|[https://github.com/apache/beam/blob/688a4ea53f315ec2aa2d37602fd78496fca8bb4f/sdks/python/apache_beam/pipeline.py#L872]]
 method, the tag is set to None for all PValues, meaning the output tags are 
set to an enumeration index over the PCollection outputs. The tags are not 
propagated to correctly which can be a problem on relying on the output 
PCollection tags to match the user set values.

The fix is to correct BEAM-1833, and always pass in the tags. However, that 
doesn't fix the problem for nested PCollections. If you have a dict of lists of 
PCollections, what should their tags be correctly set to? In order to fix this, 
first propagate the correct tag then talk with the community about the best 
auto-generated tags.

Some users may rely on the old implementation, so a flag will be created: 
"force_generated_pcollection_output_ids" and be default set to False. If True, 
this will go to the old implementation and generate tags for PCollections.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to