claudevdm commented on code in PR #34657:
URL: https://github.com/apache/beam/pull/34657#discussion_r2049689721


##########
sdks/python/apache_beam/io/gcp/bigquery_file_loads.py:
##########
@@ -1101,6 +1101,18 @@ def _load_data(
          of the load jobs would fail but not other. If any of them fails, then
          copy jobs are not triggered.
     """
+    self.reshuffle_before_load = not util.is_compat_version_prior_to(
+        p.options, "2.65.0")
+    if self.reshuffle_before_load:
+      # Ensure that TriggerLoadJob retry inputs are deterministic by breaking

Review Comment:
   Good question, I am not sure exactly where the non-determinism currently 
come from, but we have seen cases of number of files being uploaded being 
different between retries during autoscaling and this is the only plausible 
explanation I could come up with.
   
   > that should be deterministic since its operating per-element, but it is 
possible I'm missing something
   Can you elaborate on this? 
   
   Does GroupByKey guarantee determinism for the inputs to PartitionFiles? 
Without a Reshuffle it looks like part of the GroupFilesByTableDestinations (it 
lists being part of 3 stages?), PartitionFiles and the TriggerLoadJobs are 
fused into a single stage. 
   
   Adding a reshuffle puts TriggerLoadJobs* in their own stages, but it is less 
obvious what is happening with just the GBK.
   
   
   Java has this precaution
   
https://github.com/apache/beam/blob/38192def9fa842d7365f57fc7bdc03d9035db64c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L822
   
https://github.com/apache/beam/blob/38192def9fa842d7365f57fc7bdc03d9035db64c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L865
   
   https://cloud.google.com/dataflow/docs/concepts/exactly-once#output-delivery 
mentions best practice for IO's is to add a reshuffle before doing a write with 
side effects.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to