Abacn commented on issue #28219:
URL: https://github.com/apache/beam/issues/28219#issuecomment-1701863781

   The problem is Downstream PCollection lost the pane info of first GBK after 
a ReShuffle. The pipeline involved is: Trigger->GBK->ReShuffle->Downstream
   
   ReShuffle happens here: 
https://github.com/apache/beam/blob/3ff66d38c41fde475f71254d889ba46440904238/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L806
   
   After discussion we think the pane index is either work as intended or 
undefined behavior. While GBK will re-fire a pane, ReShuffle is implemented 
differently in runners. Current behavior is inconsistent among runner:
   
   Is pane info preserved in Downstream for a pipeline with 
Trigger->GBK->ReShuffle->Downstream?
   
   | runner | ReShuffle.of() | ReShuffle.viaRandomKey |
   |----|----|----|
   | Java Direct Runner | Yes | No |
   | Java Dataflow Legacy Runner | Yes | Yes |
   | Java Dataflow Runner V2 | No | No |
   
   Opened #28272 for a fix. Added randomness to job id.
   
   The implication of randomness in job id is that if the bundle gets retried 
it will have duplicates.
   
   Note that Python SDK implementation did not consider pane info but also uses 
random job id: 
https://github.com/apache/beam/blob/3ff66d38c41fde475f71254d889ba46440904238/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L89


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to