ahmedabu98 commented on code in PR #26002:
URL: https://github.com/apache/beam/pull/26002#discussion_r1150741857
##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2512,7 +2511,7 @@ def file_path_to_remove(unused_elm):
method=self.method,
job_name=job_name,
step_name=step_name,
- unique_id=unique_id,
+ unique_id=str(uuid.uuid4())[0:10],
Review Comment:
It's critical that the `unique_id` here is equal to `unique_id` on line
2492. BigQuery Export read generally works in three steps:
1. [output to files in
GCS](https://github.com/apache/beam/blob/91bfbac7a90af3b2de3f432f5f0589c527025270/sdks/python/apache_beam/io/gcp/bigquery.py#L851)
2. read from those files
3. [delete GCS
files](https://github.com/apache/beam/blob/91bfbac7a90af3b2de3f432f5f0589c527025270/sdks/python/apache_beam/io/gcp/bigquery.py#L2671)
The unique id is used when creating those file names, and the same unique id
is used to generate the name of the directory to delete after the read
operation. Unique ID should be the same for both so that we can clean up
properly.
##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2674,7 +2672,7 @@ def expand(self, pcoll):
bigquery_job_labels=self.bigquery_job_labels,
job_name=job_name,
step_name=step_name,
- unique_id=unique_id,
+ unique_id=str(uuid.uuid4())[0:10],
Review Comment:
Does doing it this way make a difference? I'm not so familiar, would refer
to a Python SDK expert for this
##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2512,7 +2511,7 @@ def file_path_to_remove(unused_elm):
method=self.method,
job_name=job_name,
step_name=step_name,
- unique_id=unique_id,
+ unique_id=str(uuid.uuid4())[0:10],
Review Comment:
What you could do here is create the `unique_id` beforehand in a DoFn then
wrap it in a `pvalue.AsSingleton` and pass it to both the Read and Cleanup
operations as a side-input. See an example
[here](https://github.com/apache/beam/blob/a4f55afa052b83237387698649b127570e1e5a63/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1180)
that does this.
In the relevant issue, it mentions this is the way Java does it. It [creates
the value in a
DoFn](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1094-L1106)
then uses it as a sideinput for
[read](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1150)
and
[cleanup](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1206).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]