jaketf commented on issue #10382:
URL: https://github.com/apache/airflow/issues/10382#issuecomment-675661153


   I think this is straightforward for import / query / copy jobs as they are 
all internal to bigquery and committed atomically.
   
   There may be a corner case with extract (to GCS) jobs. I do not believe 
export jobs > 1GB are atomic because the BigQuery export will write sharded GCS 
files. I imagine if killed at just the right time there would be just some 
portion of those sharded files committed to gcs.
   Would our expected `on_kill` behavior be to clean up those files?
   If we were to rerun the same export (with the same destination URIs in the 
config) those files would likely just be overwritten.
   UNLESS the table has become much smaller or larger between the original 
(killed) try and the second try (causing the number of shards to change).
   
   For example:
   Original extract commits these files to GCS 
   shard-00-of-5
   shard-01-of-5
   [original extract job killed]
   [we delete a few partitions from the source table]
   [submit a new extract w/ same config]
   shard-00-of-3
   shard-01-of-3
   shard-02-of-3
   
   This will leave the GCS prefix looking like this:
   shard-00-of-3
   shard-00-of-5
   shard-01-of-3
   shard-00-of-5
   shard-02-of-3
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to