[jira] [Created] (BEAM-12002) BigQuery source creates new dataset IDs during split

Chamikara Madhusanka Jayalath (Jira) Tue, 16 Mar 2021 19:29:04 -0700

Chamikara Madhusanka Jayalath created BEAM-12002:
----------------------------------------------------


             Summary: BigQuery source creates new dataset IDs during split
                 Key: BEAM-12002
                 URL: https://issues.apache.org/jira/browse/BEAM-12002
             Project: Beam
          Issue Type: Bug
          Components: io-java-gcp, io-py-gcp
            Reporter: Chamikara Madhusanka Jayalath


If unspecified by user, source creates the temp dataset ID here: 
[https://github.com/apache/beam/blob/release-2.27.0/sdks/python/apache_beam/io/gcp/bigquery.py#L786]

This means that re-runs of same source split workitem will have different 
temporary dataset IDs.

If some split() calls fail before cleaning up the dataset, such datasets will 
not be cleaned up even if the job is successful (after workitem retries).

Instead of doing this, we should create a temp dataset ID at source creation so 
that it is shared between re-runs of the same workitem. This might be 
incompatible with templates so we might have to wait till we have a SDF-based 
BigQuery source.

 

(Java potentially have a similar bug in BigQuerySource but we have a 
withTemplateCompatability() option for Java which runs BigQuery using DoFns 
which should perform the cleanup correctly)

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (BEAM-12002) BigQuery source creates new dataset IDs during split

Reply via email to