André F. created HADOOP-18568: --------------------------------- Summary: Magic Committer optional clean up Key: HADOOP-18568 URL: https://issues.apache.org/jira/browse/HADOOP-18568 Project: Hadoop Common Issue Type: Wish Components: fs/s3 Affects Versions: 3.3.3 Reporter: André F.
It seems that deleting the `__magic` folder, depending on the number of tasks/partitions used on a given spark job, can take really long time. I'm having the following behavior on a given Spark job (processing ~30TB, with ~420k tasks) using the magic committer: {code:java} 2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: Starting: Deleting magic directory s3a://my-bucket/random_hash/__magic 2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: Deleting magic directory s3a://my-bucket/random_hash/__magic: duration 26:43.620s {code} I don't see a way out of it since the deletion of s3 objects needs to list all objects under a prefix and this is what may be taking too much time. Could we somehow make this cleanup optional? (the idea would be to delegate it through s3 lifecycle policies in order to not create this overhead on the commit phase). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org