damccorm commented on code in PR #34657:
URL: https://github.com/apache/beam/pull/34657#discussion_r2051196813


##########
sdks/python/apache_beam/io/gcp/bigquery_file_loads.py:
##########
@@ -1101,6 +1101,18 @@ def _load_data(
          of the load jobs would fail but not other. If any of them fails, then
          copy jobs are not triggered.
     """
+    self.reshuffle_before_load = not util.is_compat_version_prior_to(
+        p.options, "2.65.0")
+    if self.reshuffle_before_load:
+      # Ensure that TriggerLoadJob retry inputs are deterministic by breaking

Review Comment:
   Ok - after chatting a bit offline my take is that I'm fine with this change 
if we can have confidence it will fix the issue (the performance cost is 
vanishingly small since it is just file names being shuffled, not records), but 
we should have that either empirically or theoretically.
   
   @kennknowles may have ideas on what is going on here since it may be related 
to triggering semantics - 
https://github.com/apache/beam/blob/56b286cefabcaefe551785a048ff4413e79722a8/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L961
   
   I will be AFK for the next 2 weeks, so don't block on me going forward :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to