Abacn commented on code in PR #25965:
URL: https://github.com/apache/beam/pull/25965#discussion_r1215138142


##########
sdks/python/apache_beam/io/gcp/gcsio.py:
##########
@@ -296,160 +234,87 @@ def delete_batch(self, paths):
     """
     if not paths:
       return []
-
-    paths = iter(paths)
+    if len(paths) > MAX_BATCH_OPERATION_SIZE:
+      raise TooManyRequests("Batch larger than %s", MAX_BATCH_OPERATION_SIZE)
     result_statuses = []
-    while True:
-      paths_chunk = list(islice(paths, MAX_BATCH_OPERATION_SIZE))

Review Comment:
   I see you are write for the previous behavior, and it should be considered a 
bug (after 100th element ignored). This might be a source of leftover temp test 
files in our test system. How about divide the incoming files into slices of 
100, and sending batch request sequentially?
   
   ```
   for i in range(0, len(paths), MAX_BATCH_OPERATION_SIZE):
     slice = paths[i:i + MAX_BATCH_OPERATION_SIZE]
     try:
       with self.client.batch():
         for path in paths::
         bucket_name, blob_path = parse_gcs_path(path)
         bucket = self.client.get_bucket(bucket_name)
         blob = storage.Blob(blob_path, bucket)
         blob.delete()
     except Exception as err:
       ...
   ```
   
   btw max batch size is 1000 for storage client:  
https://github.com/googleapis/python-storage/blob/2b449cd289cb573ca1653bba37ecb84d35a025ad/google/cloud/storage/batch.py#LL138C5-L138C21)
   
   Looking into the client library source code, it only tracks the lastly saw 
error (and drops previous one): 
https://github.com/googleapis/python-storage/blob/2b449cd289cb573ca1653bba37ecb84d35a025ad/google/cloud/storage/batch.py#LL240C19-L240C19
 So we would not able to get the status of single delete requests as it is for 
now.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to