ahmedabu98 opened a new pull request, #27384:
URL: https://github.com/apache/beam/pull/27384

   Users were running into an issue where a worker would fail (for whatever 
reason) after a load job was submitted to BigQuery but before it could finish 
and return as successful. In Beam's perspective, this is a bundle failure and 
so the bundle is retried. The same files are collected into a load 
configuration and a job is again submitted to BigQuery *with a different job 
ID**. Now BigQuery receives two essentially identical load jobs under different 
names and accepts both, which leads to duplication of data in the table.
   
   This PR serves to make load job names deterministic so that under such a 
scenario when Beam retries by loading the same files, it will use the same job 
ID. When BigQuery receives the same job ID, it returns a 409 "already exists" 
error, which we already handle.
   
   Other steps in this write method (copy jobs, schema update, table deletion) 
are already deterministic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to