ahmedabu98 commented on code in PR #26270:
URL: https://github.com/apache/beam/pull/26270#discussion_r1171221658
##########
sdks/python/apache_beam/io/gcp/bigquery.py:
##########
@@ -2666,11 +2664,10 @@ def _expand_export(self, pcoll):
GoogleCloudOptions).temp_location
job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
gcs_location_vp = self.gcs_location
- unique_id = str(uuid.uuid4())[0:10]
def file_path_to_remove(unused_elm):
gcs_location = bigquery_export_destination_uri(
- gcs_location_vp, temp_location, unique_id, True)
+ gcs_location_vp, temp_location, str(uuid.uuid4())[0:10], True)
Review Comment:
Some more thought needs to be put into this line here. What ends up
happening is `file_path_to_remove` is randomly generated and does not actually
point to any files.
Prior to this change, it used the same `unique_id`/`source_uuid` to create
the filepath generated in the export read here:
https://github.com/apache/beam/blob/fc7a240eb4e5f2653225c594c582749047664202/sdks/python/apache_beam/io/gcp/bigquery.py#L865-L866
But now each step is creating its own unique id, so ie. its own filepath. We
end up exporting and reading from one filepath, then attempting to delete
another filepath.
With large reads, this will lead to a lot of temp files on GCS being left
undeleted. I recommend rolling back this PR until a better solution is found
for this because it introduces a serious regression.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]