[GitHub] [beam] damccorm opened a new issue, #20841: BigQuery source creates new dataset IDs during split

GitBox Sat, 04 Jun 2022 12:54:16 -0700


damccorm opened a new issue, #20841:
URL: https://github.com/apache/beam/issues/20841


   If unspecified by user, source creates the temp dataset ID here: 
[https://github.com/apache/beam/blob/release-2.27.0/sdks/python/apache_beam/io/gcp/bigquery.py#L786](https://github.com/apache/beam/blob/release-2.27.0/sdks/python/apache_beam/io/gcp/bigquery.py#L786)
   
   This means that re-runs of same source split workitem will have different 
temporary dataset IDs.
   
   If some split() calls fail before cleaning up the dataset, such datasets 
will not be cleaned up even if the job is successful (after workitem retries).
   
   Instead of doing this, we should create a temp dataset ID at source creation 
so that it is shared between re-runs of the same workitem. This might be 
incompatible with templates so we might have to wait till we have a SDF-based 
BigQuery source.
   
    
   
   (Java potentially have a similar bug in BigQuerySource but we have a 
withTemplateCompatability() option for Java which runs BigQuery using DoFns 
which should perform the cleanup correctly)
   
    
   
   Imported from Jira 
[BEAM-12002](https://issues.apache.org/jira/browse/BEAM-12002). Original Jira 
may contain additional context.
   Reported by: chamikara.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20841: BigQuery source creates new dataset IDs during split

Reply via email to