[jira] [Commented] (MAPREDUCE-7470) multi-thread mapreduce committer

Steve Loughran (Jira) Sat, 27 Jan 2024 08:28:04 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17811557#comment-17811557
 ]


Steve Loughran commented on MAPREDUCE-7470:
-------------------------------------------

This is a duplicate of MAPREDUCE-7465, and I don't want it in for the same 
reasons: scale and correctness.

Read: https://github.com/steveloughran/zero-rename-committer for background

h3. V1: 
* relies on atomic directory rename for task commit.
* Slow and broken task commit on s3 O(data)
* O(files) and broken task commit on google
* awfully slow job commit (s3, again O(data))

h3. v2 broken everywhere
unable to cope with failure of task commit during nonatomic commit process.

S3A ships with a zero rename committer which uploads directly to the 
destination but doesn't complete the rename until job commit (O(files/threads) 
commit

We didn't implement this directly in FileOutputCommitter because of fear of 
braking things. That's a critical piece of code and it has two different 
intermingled algorithms and about the only place I've ever come across 
co-recursion in production. It's complex and brittle.

Instead we added the ability to define a committer factory for different 
filesystems.

h3. Manifest Committer

MAPREDUCE-7341 added an "intermediate manifest committer", based on 
* the s3a committer experience
* benchmarking abfs including how it fails under load
* need to make google jobs safe

Task commit: treewalk task attempt dir, save a manifest of files to rename and 
dirs to creste. Moves a lot of the IO to task commit, which really matters for 
abfs. safe on GCS.

Job commit: parallel load of manifests, dir creation phase and then parallel 
rename
* fast parallel load of manifests
* parallel rename O(files/threads)
* uses etag tracking for recovery from throttle-initiated rename failures on 
abfs (which is why multithreaded job commit doesn't work reliably there: they 
can do so much IO that throttling kicks off. Also lets you rate limit
* collects and saves statistics to the JSON file; same format as the s3a 
committers

This is in hadoop 3.3.5+. Yes, it's more complex than parallel threads, but we 
wrote it *because we tried the parallel design and it didn't work*. FWIW, 
here's our PR for parallel runs: https://github.com/apache/hadoop/pull/6399

Arnaud's is pretty similar: https://github.com/apache/hadoop/pull/6378

both of these (like yours) will overload azure storage on big jobs/multiple 
jobs in same storage account committing; none of these prs are safe for GCS. 

I don't know about allyun's correctness here.

Can you try the manifest committer, see how it goes and report any issues. 
Sharing the _SUCCESS file would be nice too



> multi-thread mapreduce committer
> --------------------------------
>
>                 Key: MAPREDUCE-7470
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7470
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>            Reporter: TianyiMa
>            Priority: Major
>              Labels: mapreduce, pull-request-available
>         Attachments: MAPREDUCE-7470.0.patch
>
>
> In cloud environment, such as aws, aliyun etc., the internet delay is 
> non-trival when we commit thounds of files.
> In our situation, the ping delay is about 0.03ms in IDC, but when move to 
> Coud, the ping delay is about 3ms, which is roughly 100x slower. We found 
> that, committing tens thounds of files will cost a few tens of minutes. The 
> more files there are, the logger it takes.
> So we propose a new committer algorithm, which is a variant of committer 
> algorithm version 1, called 3. In this new algorithm 3, in order to decrease 
> the committer time, we use a thread pool to commit job's final output.
> Our test result in Cloud production shows that, the new algorithm 3 has 
> decrease the committer time by serveral tens of times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

[jira] [Commented] (MAPREDUCE-7470) multi-thread mapreduce committer

Reply via email to