[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

via GitHub Wed, 19 Apr 2023 04:47:57 -0700


ahmedabu98 commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171221658



##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   Some more thought needs to be put into this line here. What ends up 
happening is `file_path_to_remove` is randomly generated and does not actually 
point to any files. 
   
   Prior to this change, it used the same `unique_id`/`source_uuid` to create 
the filepath generated in the export read here: 
https://github.com/apache/beam/blob/fc7a240eb4e5f2653225c594c582749047664202/sdks/python/apache_beam/io/gcp/bigquery.py#L865-L866
   
   But now each step is creating its own unique id, so ie. its own filepath. We 
end up exporting and reading from one filepath, then attempting to delete 
another filepath.
   
   With large reads, this will lead to a lot of temp files on GCS being left 
undeleted. I recommend rolling back this PR until a better solution is found 
for this because it introduces a serious regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] ahmedabu98 commented on a diff in pull request #26270: Generating Unique ID within function so that it's unique for every run.

Reply via email to