ahmedabu98 commented on PR #28272: URL: https://github.com/apache/beam/pull/28272#issuecomment-1714176376
After a Reshuffle (we have one [before single loads](https://github.com/ahmedabu98/beam/blob/67e88d9fec9b4ac39a6d42a4bc1e64e1236bc309/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L806) and [temp loads](https://github.com/ahmedabu98/beam/blob/67e88d9fec9b4ac39a6d42a4bc1e64e1236bc309/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L763)), RunnerV2 doesn't propagate pane indices like legacy runner does. Instead of an increasing pane index, RunnerV2 keeps it at 0. After our reshuffles, we directly rely on pane index to [construct a job ID](https://github.com/ahmedabu98/beam/blob/67e88d9fec9b4ac39a6d42a4bc1e64e1236bc309/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteTables.java#L252-L254). In the RunnerV2 case the index stays at 0 so in some cases (e.g. single loads) it produces the same job ID for each trigger. The resulting behavior is that only the first load is successful and the rest are seen as duplicates by BQ and ignored. Yi's solution here is to save the pane index from WritePartitions (a stage that is before Reshuffle) and use that information instead when creating the Job ID in WriteTables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
