ahmedabu98 commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171221658


##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
         GoogleCloudOptions).temp_location
     job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
     gcs_location_vp = self.gcs_location
-    unique_id = str(uuid.uuid4())[0:10]
 
     def file_path_to_remove(unused_elm):
       gcs_location = bigquery_export_destination_uri(
-          gcs_location_vp, temp_location, unique_id, True)
+          gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)

Review Comment:
   Some more thought needs to be put into this line here. What ends up 
happening is `file_path_to_remove` is randomly generated and does not actually 
point to any files. 
   
   Prior to this change, it used the same `unique_id`/`source_uuid` to create 
the filepath generated in the export read here: 
https://github.com/apache/beam/blob/fc7a240eb4e5f2653225c594c582749047664202/sdks/python/apache_beam/io/gcp/bigquery.py#L865-L866
   
   But now each step is creating its own unique id, so ie. its own filepath. We 
end up exporting and reading from one filepath, then attempting to delete 
another filepath.
   
   With large reads, this will lead to a lot of temp files on GCS being left 
undeleted. I recommend rolling back this PR until a better solution is found 
for this because it introduces a serious regression.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to