ihji commented on pull request #12086:
URL: https://github.com/apache/beam/pull/12086#issuecomment-653377848
Here's a concrete example that can be fixed by this PR:
```
Stage A:
(Input PCollection i - PTransform 1 - Output PCollection j)
Stage B:
(Input PCollection j - PTransform 2 - Output PCollection k)
Stage C:
(Input PCollection j for side input - PTransform 3)
```
We want to find `Stage A` for emitting side input for `Stage C`. However,
some synthetic PTransforms are inserted during pipeline optimization phase:
```
Stage A:
(Input PCollection i - PTransform 1 - Output PCollection j)
(Input PCollection j - Synthetic Write - Data Sink)
Stage B:
(Data source - Synthetic Read - Output PCollection j)
(Input PCollection j - PTransform 2 - Output PCollection k)
Stage C:
(Input PCollection j for side input - PTransform 3)
```
If we allow multiple assignments to `producing_stages_by_pcoll`, `Stage B`
will emit side input for `Stage C` (topologically `Stage B` comes after `Stage
A`). Since `Stage B` and `Stage C` have no dependencies, the pipeline will
succeed when `Stage B` is executed first and fail when `Stage C` is executed
before `Stage B`.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]