shunping commented on code in PR #33611:
URL: https://github.com/apache/beam/pull/33611#discussion_r1927178003
##########
sdks/python/apache_beam/io/gcp/gcsio.py:
##########
@@ -247,13 +247,35 @@ def open(
def delete(self, path):
"""Deletes the object at the given GCS path.
+ If the path is a directory (prefix), it deletes all blobs under that
prefix.
+
Args:
path: GCS file path pattern in the form gs://<bucket>/<name>.
"""
bucket_name, blob_name = parse_gcs_path(path)
bucket = self.client.bucket(bucket_name)
+
+ # Check if the blob is a directory (prefix) by listing objects
+ # under that prefix.
+ blobs = list(bucket.list_blobs(prefix=blob_name))
Review Comment:
I am afraid this line could potentially impact the performance of existing
pipelines. Particularly, we now have an extra HTTP request to GCS to list blobs
in a bucket no matter what. If an existing pipeline is using gcsio.delete() to
delete a directory with a large number of files, then it will double the HTTP
requests to GCS, which may lead to a request quota exceeded error.
I think a safer approach is to add a new api in gcsio particularly for this
function, then change delete() in gcsfilesystem.py to use this api. Notice that
the original issue #27605 is related to the behavior of delete() under
gcsfilesystem.py.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]