[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26002: Moving UUID definition in Python BQ IO

via GitHub Tue, 28 Mar 2023 08:18:22 -0700


ahmedabu98 commented on code in PR #26002:
URL: https://github.com/apache/beam/pull/26002#discussion_r1150741857



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2512,7 +2511,7 @@ def file_path_to_remove(unused_elm):
                 method=self.method,
                 job_name=job_name,
                 step_name=step_name,
-                unique_id=unique_id,
+                unique_id=str(uuid.uuid4())[0:10],

Review Comment:
   It's critical that the `unique_id` here is equal to `unique_id` on line 
2492. BigQuery Export read generally works in three steps: 
   1. [output to files in 
GCS](https://github.com/apache/beam/blob/91bfbac7a90af3b2de3f432f5f0589c527025270/sdks/python/apache_beam/io/gcp/bigquery.py#L851)
   2. read from those files
   3. [delete GCS 
files](https://github.com/apache/beam/blob/91bfbac7a90af3b2de3f432f5f0589c527025270/sdks/python/apache_beam/io/gcp/bigquery.py#L2671)
   
   The unique id is used when creating those file names, and the same unique id 
is used to generate the name of the directory to delete after the read 
operation. Unique ID should be the same for both so that we can clean up 
properly.



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2674,7 +2672,7 @@ def expand(self, pcoll):
             bigquery_job_labels=self.bigquery_job_labels,
             job_name=job_name,
             step_name=step_name,
-            unique_id=unique_id,
+            unique_id=str(uuid.uuid4())[0:10],

Review Comment:
   Does doing it this way make a difference? I'm not so familiar, would refer 
to a Python SDK expert for this



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2512,7 +2511,7 @@ def file_path_to_remove(unused_elm):
                 method=self.method,
                 job_name=job_name,
                 step_name=step_name,
-                unique_id=unique_id,
+                unique_id=str(uuid.uuid4())[0:10],

Review Comment:
   What you could do here is create the `unique_id` beforehand in a DoFn then 
wrap it in a `pvalue.AsSingleton` and pass it to both the Read and Cleanup 
operations as a side-input. See an example 
[here](https://github.com/apache/beam/blob/a4f55afa052b83237387698649b127570e1e5a63/sdks/python/apache_beam/io/gcp/bigquery_file_loads.py#L1180)
 that does this.
   
   In the relevant issue, it mentions this is the way Java does it. It [creates 
the value in a 
DoFn](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1094-L1106)
 then uses it as a sideinput for 
[read](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1150)
 and 
[cleanup](https://github.com/apache/beam/blob/b3b184361100492f92fcc51ff82a0fcd962d5ee0/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.java#L1206).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26002: Moving UUID definition in Python BQ IO

Reply via email to