ahmedabu98 commented on PR #28272:
URL: https://github.com/apache/beam/pull/28272#issuecomment-1714176376

   After a Reshuffle (we have one [before single 
loads](https://github.com/ahmedabu98/beam/blob/67e88d9fec9b4ac39a6d42a4bc1e64e1236bc309/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L806)
 and [temp 
loads](https://github.com/ahmedabu98/beam/blob/67e88d9fec9b4ac39a6d42a4bc1e64e1236bc309/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L763)),
 RunnerV2 doesn't propagate pane indices like legacy runner does. Instead of an 
increasing pane index, RunnerV2 keeps it at 0. 
   
   After our reshuffles, we directly rely on pane index to [construct a job 
ID](https://github.com/ahmedabu98/beam/blob/67e88d9fec9b4ac39a6d42a4bc1e64e1236bc309/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L252-L254).
 In the RunnerV2 case the index stays at 0 so in some cases (e.g. single loads) 
it produces the same job ID for each trigger. The resulting behavior is that 
only the first load is successful and the rest are seen as duplicates by BQ and 
ignored. 
   
   Yi's solution here is to save the pane index from WritePartitions (a stage 
that is before Reshuffle) and use that information instead when creating the 
Job ID in WriteTables.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to