[
https://issues.apache.org/jira/browse/BEAM-9322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Work on BEAM-9322 started by null.
----------------------------------
> Python SDK ignores manually set PCollection tags
> ------------------------------------------------
>
> Key: BEAM-9322
> URL: https://issues.apache.org/jira/browse/BEAM-9322
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Sam Rohde
> Priority: P3
> Time Spent: 10h 40m
> Remaining Estimate: 0h
>
> The Python SDK currently ignores any tags set on PCollections manually when
> applying PTransforms when adding the PCollection to the PTransform
> [outputs|[https://github.com/apache/beam/blob/688a4ea53f315ec2aa2d37602fd78496fca8bb4f/sdks/python/apache_beam/pipeline.py#L595]].
> In the
> [add_output|[https://github.com/apache/beam/blob/688a4ea53f315ec2aa2d37602fd78496fca8bb4f/sdks/python/apache_beam/pipeline.py#L872]]
> method, the tag is set to None for all PValues, meaning the output tags are
> set to an enumeration index over the PCollection outputs. The tags are not
> propagated to correctly which can be a problem on relying on the output
> PCollection tags to match the user set values.
> The fix is to correct BEAM-1833, and always pass in the tags. However, that
> doesn't fix the problem for nested PCollections. If you have a dict of lists
> of PCollections, what should their tags be correctly set to? In order to fix
> this, first propagate the correct tag then talk with the community about the
> best auto-generated tags.
> Some users may rely on the old implementation, so a flag will be created:
> "force_generated_pcollection_output_ids" and be default set to False. If
> True, this will go to the old implementation and generate tags for
> PCollections.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)