[ https://issues.apache.org/jira/browse/HADOOP-18568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952468#comment-17952468 ]
ASF GitHub Bot commented on HADOOP-18568: ----------------------------------------- blcksrx opened a new pull request, #7693: URL: https://github.com/apache/hadoop/pull/7693 <!-- Thanks for sending a pull request! 1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute 2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'. --> ### Description of PR magic committer optional cleanup, make it up to the user to decide the magic committer cleanup committer `_magic` path or not to reduce the cleanup overhead. ### How was this patch tested? It tested by checking the existing of the job commit path against this option: the cleanup is enabled -> committer path does not exists after JobCommit the cleanup is disabled -> committer path exists after JobCommit ### For code changes: - [x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [x] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Magic Committer optional clean up > ---------------------------------- > > Key: HADOOP-18568 > URL: https://issues.apache.org/jira/browse/HADOOP-18568 > Project: Hadoop Common > Issue Type: Wish > Components: fs/s3 > Affects Versions: 3.3.3 > Reporter: André F. > Priority: Minor > > It seems that deleting the `__magic` folder, depending on the number of > tasks/partitions used on a given spark job, can take really long time. I'm > having the following behavior on a given Spark job (processing ~30TB, with > ~420k tasks) using the magic committer: > {code:java} > 2022-12-10T21:25:19.629Z pool-3-thread-32 INFO MagicS3GuardCommitter: > Starting: Deleting magic directory s3a://my-bucket/random_hash/__magic > 2022-12-10T21:52:03.250Z pool-3-thread-32 INFO MagicS3GuardCommitter: > Deleting magic directory s3a://my-bucket/random_hash/__magic: duration > 26:43.620s {code} > I don't see a way out of it since the deletion of s3 objects needs to list > all objects under a prefix and this is what may be taking too much time. > Could we somehow make this cleanup optional? (the idea would be to delegate > it through s3 lifecycle policies in order to not create this overhead on the > commit phase). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org