claudevdm commented on code in PR #34657: URL: https://github.com/apache/beam/pull/34657#discussion_r2049689721
########## sdks/python/apache_beam/io/gcp/bigquery_file_loads.py: ########## @@ -1101,6 +1101,18 @@ def _load_data( of the load jobs would fail but not other. If any of them fails, then copy jobs are not triggered. """ + self.reshuffle_before_load = not util.is_compat_version_prior_to( + p.options, "2.65.0") + if self.reshuffle_before_load: + # Ensure that TriggerLoadJob retry inputs are deterministic by breaking Review Comment: Good question, I am not sure exactly where the non-determinism currently come from, but we have seen cases of number of files being uploaded being different between retries during autoscaling and this is the only plausible explanation I could come up with. > that should be deterministic since its operating per-element, but it is possible I'm missing something Can you elaborate on this? Does GroupByKey guarantee determinism for the inputs to PartitionFiles? Without a Reshuffle it looks like part of the GroupFilesByTableDestinations (it lists being part of 3 stages?), PartitionFiles and the TriggerLoadJobs are fused into a single stage. Adding a reshuffle puts TriggerLoadJobs* in their own stages, but it is less obvious what is happening with just the GBK. Java has this precaution https://github.com/apache/beam/blob/38192def9fa842d7365f57fc7bdc03d9035db64c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L822 https://github.com/apache/beam/blob/38192def9fa842d7365f57fc7bdc03d9035db64c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L865 https://cloud.google.com/dataflow/docs/concepts/exactly-once#output-delivery mentions best practice for IO's is to add a reshuffle before doing a write with side effects. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org