André F. created HADOOP-18568:
---------------------------------

             Summary: Magic Committer optional clean up 
                 Key: HADOOP-18568
                 URL: https://issues.apache.org/jira/browse/HADOOP-18568
             Project: Hadoop Common
          Issue Type: Wish
          Components: fs/s3
    Affects Versions: 3.3.3
            Reporter: André F.


It seems that deleting the `__magic` folder, depending on the number of 
tasks/partitions used on a given spark job, can take really long time. I'm 
having the following behavior on a given Spark job (processing ~30TB, with 
~420k tasks) using the magic committer:
{code:java}
2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: Starting: 
Deleting magic directory s3a://my-bucket/random_hash/__magic
2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: Deleting 
magic directory s3a://my-bucket/random_hash/__magic: duration 26:43.620s {code}
I don't see a way out of it since the deletion of s3 objects needs to list all 
objects under a prefix and this is what may be taking too much time. Could we 
somehow make this cleanup optional? (the idea would be to delegate it through 
s3 lifecycle policies in order to not create this overhead on the commit phase).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-dev-h...@hadoop.apache.org

Reply via email to