[GitHub] [beam] damccorm opened a new issue, #20065: Python SDK ignores manually set PCollection tags

GitBox Sat, 04 Jun 2022 08:22:36 -0700


damccorm opened a new issue, #20065:
URL: https://github.com/apache/beam/issues/20065

The Python SDK currently ignores any tags set on PCollections manually when
applying PTransforms when adding the PCollection to the PTransform
[outputs]([https://github.com/apache/beam/blob/688a4ea53f315ec2aa2d37602fd78496fca8bb4f/sdks/python/apache_beam/pipeline.py#L595)].
In the
[add_output]([https://github.com/apache/beam/blob/688a4ea53f315ec2aa2d37602fd78496fca8bb4f/sdks/python/apache_beam/pipeline.py#L872)]
method, the tag is set to None for all PValues, meaning the output tags are
set to an enumeration index over the PCollection outputs. The tags are not
propagated to correctly which can be a problem on relying on the output
PCollection tags to match the user set values.

The fix is to correct BEAM-1833, and always pass in the tags. However, that
doesn't fix the problem for nested PCollections. If you have a dict of lists of
PCollections, what should their tags be correctly set to? In order to fix this,
first propagate the correct tag then talk with the community about the best
auto-generated tags.

Some users may rely on the old implementation, so a flag will be created:
"force_generated_pcollection_output_ids" and be default set to False. If True,
this will go to the old implementation and generate tags for PCollections.

Imported from Jira
[BEAM-9322](https://issues.apache.org/jira/browse/BEAM-9322). Original Jira may
contain additional context.
Reported by: rohdesam.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20065: Python SDK ignores manually set PCollection tags

Reply via email to