jackye1995 commented on a change in pull request #4052:
URL: https://github.com/apache/iceberg/pull/4052#discussion_r800906636



##########
File path: aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java
##########
@@ -94,6 +106,53 @@ public void deleteFile(String path) {
     client().deleteObject(deleteRequest);
   }
 
+  /**
+   * Performs a batch deletion of the given paths. This implementation does 
not require
+   * that the passed in paths refer to objects which are in the same bucket.
+   * @param paths poths to delete
+   */
+  @Override
+  public void deleteFiles(Iterable<String> paths) {
+    SetMultimap<String, String> bucketToObjects = 
Multimaps.newSetMultimap(Maps.newHashMap(), Sets::newHashSet);
+    for (String path : paths) {
+      S3URI location = new S3URI(path);
+      String bucket = location.bucket();
+      String objectKey = location.key();
+      Set<String> objectsInBucket = bucketToObjects.get(bucket);
+      if (objectsInBucket.size() == awsProperties.s3DeletionBatchSize()) {
+        deleteObjectsInBucket(bucket, objectsInBucket);
+        bucketToObjects.removeAll(bucket);
+      }
+      bucketToObjects.get(bucket).add(objectKey);
+    }
+    // Delete the remainder
+    bucketToObjects
+        .asMap()
+        .forEach((bucket, objectsToDelete) -> deleteObjectsInBucket(bucket, 
objectsToDelete));
+  }
+
+  private void deleteObjectsInBucket(String bucket, Collection<String> 
objects) {

Review comment:
       wonder if we should return exceptions and then throw as a whole in the 
parent method

##########
File path: aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java
##########
@@ -213,13 +213,20 @@
    * Enables eTag checks for S3 PUT and MULTIPART upload requests.
    */
   public static final String S3_CHECKSUM_ENABLED = "s3.checksum-enabled";
+
+  /**
+   * Sets the size for batch deletion.
+   */

Review comment:
       I think we also need a config for this

##########
File path: aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java
##########
@@ -213,13 +213,20 @@
    * Enables eTag checks for S3 PUT and MULTIPART upload requests.
    */
   public static final String S3_CHECKSUM_ENABLED = "s3.checksum-enabled";
+
+  /**
+   * Sets the size for batch deletion.
+   */
+  public static final int S3_DELETION_BATCH_SIZE_DEFAULT = 10000;

Review comment:
       1000 is the upper limit, we need to be cautious for setting that as 
default... because the request payload size might be too large, especially 
given the fact that S3 file path generated by Iceberg is not short. We need 
some tests for this to see if 1000 is actually a good default.

##########
File path: api/src/main/java/org/apache/iceberg/io/FileIO.java
##########
@@ -68,6 +68,15 @@ default void deleteFile(OutputFile file) {
   default void initialize(Map<String, String> properties) {
   }
 
+  /**
+   * Delete the files at the given path.
+   */
+  default void deleteFiles(Iterable<String> pathsToDelete) {

Review comment:
       Would S3 API fail us if there is any duplicated path in the input 
request? Have you tried out?

##########
File path: aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java
##########
@@ -213,13 +213,20 @@
    * Enables eTag checks for S3 PUT and MULTIPART upload requests.
    */
   public static final String S3_CHECKSUM_ENABLED = "s3.checksum-enabled";
+
+  /**
+   * Sets the size for batch deletion.
+   */
+  public static final int S3_DELETION_BATCH_SIZE_DEFAULT = 10000;
+
   public static final boolean CLIENT_ENABLE_ETAG_CHECK_DEFAULT = false;
 
   private String s3FileIoSseType;
   private String s3FileIoSseKey;
   private String s3FileIoSseMd5;
   private int s3FileIoMultipartUploadThreads;
   private int s3FileIoMultiPartSize;
+  private int s3DeletionBatchSize;

Review comment:
       `s3FileIoDeleteBatchSize`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to