amogh-jahagirdar commented on a change in pull request #4052:
URL: https://github.com/apache/iceberg/pull/4052#discussion_r814142157
##########
File path: aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIO.java
##########
@@ -100,6 +115,67 @@ public void deleteFile(String path) {
client().deleteObject(deleteRequest);
}
+ /**
+ * Deletes the given paths in a batched manner.
+ * <p>
+ * The paths are grouped by bucket, and deletion is triggered when we either
reach the configured batch size
+ * or have a final remainder batch for each bucket.
+ *
+ * @param paths paths to delete
+ */
+ @Override
+ public void deleteFiles(Iterable<String> paths) {
+ SetMultimap<String, String> bucketToObjects =
Multimaps.newSetMultimap(Maps.newHashMap(), Sets::newHashSet);
+ List<String> failedDeletions = Lists.newArrayList();
+ for (String path : paths) {
+ S3URI location = new S3URI(path);
+ String bucket = location.bucket();
+ String objectKey = location.key();
+ Set<String> objectsInBucket = bucketToObjects.get(bucket);
+ if (objectsInBucket.size() == awsProperties.s3FileIoDeleteBatchSize()) {
+ List<String> failedDeletionsForBatch = deleteObjectsInBucket(bucket,
objectsInBucket);
Review comment:
I was thinking it would be up to the provider of the S3 client who would
configure the retry policy on the client. Is that something within the scope of
FileIO? If so I think that's something we could tackle in a follow-on.
Someone could use a custom AwsClientFactory . The DefaultAwsClientFactory
will create an S3 client with the default retry policy which would retry on the
failures mentioned in
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/retry/PredefinedRetryPolicies.SDKDefaultRetryCondition.html.
So basically 5xx errors like service unavailable, throttling, clock-skew etc
would be retried. Failures such as the bucket not existing, or unauthorized 4xx
errors would not be retried by default. @jackye1995 @rdblue thoughts?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]