amogh-jahagirdar commented on a change in pull request #4052:
URL: https://github.com/apache/iceberg/pull/4052#discussion_r817166562
##########
File path: aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java
##########
@@ -100,6 +115,67 @@ public void deleteFile(String path) {
client().deleteObject(deleteRequest);
}
+ /**
+ * Deletes the given paths in a batched manner.
+ * <p>
+ * The paths are grouped by bucket, and deletion is triggered when we either
reach the configured batch size
+ * or have a final remainder batch for each bucket.
+ *
+ * @param paths paths to delete
+ */
+ @Override
+ public void deleteFiles(Iterable<String> paths) {
+ SetMultimap<String, String> bucketToObjects =
Multimaps.newSetMultimap(Maps.newHashMap(), Sets::newHashSet);
+ List<String> failedDeletions = Lists.newArrayList();
+ for (String path : paths) {
+ S3URI location = new S3URI(path);
+ String bucket = location.bucket();
+ String objectKey = location.key();
+ Set<String> objectsInBucket = bucketToObjects.get(bucket);
+ if (objectsInBucket.size() == awsProperties.s3FileIoDeleteBatchSize()) {
+ List<String> failedDeletionsForBatch = deleteObjectsInBucket(bucket,
objectsInBucket);
Review comment:
Actually, I realized the current default implementation would be
inconsistent with what I just mentioned. for default, we just loop over the
files and delete, we don't surface the failure at the end; if there's a failure
it will be surfaced immediately. I'm still leaning towards the deleteFiles
semantic being a best effort deletion attempt on all files in the list
(surfacing failures at the end). So I am more leaning towards changing the
default implementation.
Let me know your thoughts if you agree on this semantic for deleteFiles
@rdblue @jackye1995 @rdblue
##########
File path: aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java
##########
@@ -100,6 +115,67 @@ public void deleteFile(String path) {
client().deleteObject(deleteRequest);
}
+ /**
+ * Deletes the given paths in a batched manner.
+ * <p>
+ * The paths are grouped by bucket, and deletion is triggered when we either
reach the configured batch size
+ * or have a final remainder batch for each bucket.
+ *
+ * @param paths paths to delete
+ */
+ @Override
+ public void deleteFiles(Iterable<String> paths) {
+ SetMultimap<String, String> bucketToObjects =
Multimaps.newSetMultimap(Maps.newHashMap(), Sets::newHashSet);
+ List<String> failedDeletions = Lists.newArrayList();
+ for (String path : paths) {
+ S3URI location = new S3URI(path);
+ String bucket = location.bucket();
+ String objectKey = location.key();
+ Set<String> objectsInBucket = bucketToObjects.get(bucket);
+ if (objectsInBucket.size() == awsProperties.s3FileIoDeleteBatchSize()) {
+ List<String> failedDeletionsForBatch = deleteObjectsInBucket(bucket,
objectsInBucket);
Review comment:
Actually, I realized the current default implementation would be
inconsistent with what I just mentioned. for default, we just loop over the
files and delete, we don't surface the failure at the end; if there's a failure
it will be surfaced immediately. I'm still leaning towards the deleteFiles
semantic being a best effort deletion attempt on all files in the list
(surfacing failures at the end). So I am more leaning towards changing the
default implementation.
Let me know your thoughts if you agree on this semantic for deleteFiles
@rdblue @jackye1995 @danielcweeks
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]