mohamedawnallah commented on code in PR #33611:
URL: https://github.com/apache/beam/pull/33611#discussion_r1927380720


##########
sdks/python/apache_beam/io/gcp/gcsio.py:
##########
@@ -247,13 +247,35 @@ def open(
   def delete(self, path):
     """Deletes the object at the given GCS path.
 
+    If the path is a directory (prefix), it deletes all blobs under that 
prefix.
+
     Args:
       path: GCS file path pattern in the form gs://<bucket>/<name>.
     """
     bucket_name, blob_name = parse_gcs_path(path)
     bucket = self.client.bucket(bucket_name)
+
+    # Check if the blob is a directory (prefix) by listing objects
+    # under that prefix.
+    blobs = list(bucket.list_blobs(prefix=blob_name))

Review Comment:
   > If an existing pipeline is using gcsio.delete() to delete a directory with 
a large number of files, then the change will double the HTTP requests to GCS, 
which may lead to a request quota exceeded error.
   
   Makes sense 👍
   
   > Notice that the original issue https://github.com/apache/beam/issues/27605 
is related to the behavior of delete() under gcsfilesystem.py.
   
   Ooops missed that. 🙏
   
   > I think a safer approach is to add a new api in gcsio particularly for 
this function, then change delete() in gcsfilesystem.py to use this api.  We 
can then recommend users to use this api while deleting a directory.
   
   How about adding an optional parameter, such as `recursive`, with a default 
value of `false` to ensure backward compatibility and avoid any performance 
impact on existing pipelines using the `delete` function in `gcsfilesystem.py`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to