[ https://issues.apache.org/jira/browse/MAPREDUCE-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811557#comment-17811557 ]
Steve Loughran commented on MAPREDUCE-7470: ------------------------------------------- This is a duplicate of MAPREDUCE-7465, and I don't want it in for the same reasons: scale and correctness. Read: https://github.com/steveloughran/zero-rename-committer for background h3. V1: * relies on atomic directory rename for task commit. * Slow and broken task commit on s3 O(data) * O(files) and broken task commit on google * awfully slow job commit (s3, again O(data)) h3. v2 broken everywhere unable to cope with failure of task commit during nonatomic commit process. S3A ships with a zero rename committer which uploads directly to the destination but doesn't complete the rename until job commit (O(files/threads) commit We didn't implement this directly in FileOutputCommitter because of fear of braking things. That's a critical piece of code and it has two different intermingled algorithms and about the only place I've ever come across co-recursion in production. It's complex and brittle. Instead we added the ability to define a committer factory for different filesystems. h3. Manifest Committer MAPREDUCE-7341 added an "intermediate manifest committer", based on * the s3a committer experience * benchmarking abfs including how it fails under load * need to make google jobs safe Task commit: treewalk task attempt dir, save a manifest of files to rename and dirs to creste. Moves a lot of the IO to task commit, which really matters for abfs. safe on GCS. Job commit: parallel load of manifests, dir creation phase and then parallel rename * fast parallel load of manifests * parallel rename O(files/threads) * uses etag tracking for recovery from throttle-initiated rename failures on abfs (which is why multithreaded job commit doesn't work reliably there: they can do so much IO that throttling kicks off. Also lets you rate limit * collects and saves statistics to the JSON file; same format as the s3a committers This is in hadoop 3.3.5+. Yes, it's more complex than parallel threads, but we wrote it *because we tried the parallel design and it didn't work*. FWIW, here's our PR for parallel runs: https://github.com/apache/hadoop/pull/6399 Arnaud's is pretty similar: https://github.com/apache/hadoop/pull/6378 both of these (like yours) will overload azure storage on big jobs/multiple jobs in same storage account committing; none of these prs are safe for GCS. I don't know about allyun's correctness here. Can you try the manifest committer, see how it goes and report any issues. Sharing the _SUCCESS file would be nice too > multi-thread mapreduce committer > -------------------------------- > > Key: MAPREDUCE-7470 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-7470 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mrv2 > Reporter: TianyiMa > Priority: Major > Labels: mapreduce, pull-request-available > Attachments: MAPREDUCE-7470.0.patch > > > In cloud environment, such as aws, aliyun etc., the internet delay is > non-trival when we commit thounds of files. > In our situation, the ping delay is about 0.03ms in IDC, but when move to > Coud, the ping delay is about 3ms, which is roughly 100x slower. We found > that, committing tens thounds of files will cost a few tens of minutes. The > more files there are, the logger it takes. > So we propose a new committer algorithm, which is a variant of committer > algorithm version 1, called 3. In this new algorithm 3, in order to decrease > the committer time, we use a thread pool to commit job's final output. > Our test result in Cloud production shows that, the new algorithm 3 has > decrease the committer time by serveral tens of times. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org