Abacn commented on issue #28219: URL: https://github.com/apache/beam/issues/28219#issuecomment-1701863781
The problem is Downstream PCollection lost the pane info of first GBK after a ReShuffle. The pipeline involved is: Trigger->GBK->ReShuffle->Downstream ReShuffle happens here: https://github.com/apache/beam/blob/3ff66d38c41fde475f71254d889ba46440904238/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L806 After discussion we think the pane index is either work as intended or undefined behavior. While GBK will re-fire a pane, ReShuffle is implemented differently in runners. Current behavior is inconsistent among runner: Is pane info preserved in Downstream for a pipeline with Trigger->GBK->ReShuffle->Downstream? | runner | ReShuffle.of() | ReShuffle.viaRandomKey | |----|----|----| | Java Direct Runner | Yes | No | | Java Dataflow Legacy Runner | Yes | Yes | | Java Dataflow Runner V2 | No | No | Opened #28272 for a fix. Added randomness to job id. The implication of randomness in job id is that if the bundle gets retried it will have duplicates. Note that Python SDK implementation did not consider pane info but also uses random job id: https://github.com/apache/beam/blob/3ff66d38c41fde475f71254d889ba46440904238/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L89 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
