cnauroth commented on code in PR #7693: URL: https://github.com/apache/hadoop/pull/7693#discussion_r2159520024
########## hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/committers.md: ########## @@ -1058,3 +1059,20 @@ one of the following conditions are met 1. The committer is being used in spark, and the version of spark being used does not set the `spark.sql.sources.writeJobUUID` property. Either upgrade to a new spark release, or set `fs.s3a.committer.generate.uuid` to true. + +### Long Job Completion Time Due to Magic Committer Cleanup +When using the S3A Magic Committer in large Spark or MapReduce jobs, job completion can be significantly delayed +due to the cleanup of temporary files (such as those under the `__magic` directory). +This happens because deleting many small files in S3 is a slow and expensive operation, especially at scale. +In some cases, the cleanup phase alone can take several minutes or more — even after all data has already been written. + +To reduce this overhead, Hadoop 3.4.1+ introduced a configuration option in Review Comment: I'm going to commit this with a small change here, "3.4.2+", as we're currently making release candidates for 3.4.2. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org