claudevdm commented on code in PR #34657:
URL: https://github.com/apache/beam/pull/34657#discussion_r2049689721
##########
sdks/python/apache_beam/io/gcp/bigquery_file_loads.py:
##########
@@ -1101,6 +1101,18 @@ def _load_data(
of the load jobs would fail but not other. If any of them fails, then
copy jobs are not triggered.
"""
+ self.reshuffle_before_load = not util.is_compat_version_prior_to(
+ p.options, "2.65.0")
+ if self.reshuffle_before_load:
+ # Ensure that TriggerLoadJob retry inputs are deterministic by breaking
Review Comment:
Good question, I am not sure exactly where the non-determinism currently
come from, but we have seen cases of number of files being uploaded being
different between retries during autoscaling and this is the only plausible
explanation I could come up with.
> that should be deterministic since its operating per-element, but it is
possible I'm missing something
Can you elaborate on this?
Does GroupByKey guarantee determinism for the inputs to PartitionFiles?
Without a Reshuffle it looks like part of the GroupFilesByTableDestinations (it
lists being part of 3 stages?), PartitionFiles and the TriggerLoadJobs are
fused into a single stage.
Adding a reshuffle puts TriggerLoadJobs* in their own stages, but it is less
obvious what is happening with just the GBK.
Java has this precaution
https://github.com/apache/beam/blob/38192def9fa842d7365f57fc7bdc03d9035db64c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L822
https://github.com/apache/beam/blob/38192def9fa842d7365f57fc7bdc03d9035db64c/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BatchLoads.java#L865
https://cloud.google.com/dataflow/docs/concepts/exactly-once#output-delivery
mentions best practice for IO's is to add a reshuffle before doing a write with
side effects.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]