[jira] [Commented] (MAPREDUCE-7470) multi-thread mapreduce committer

ASF GitHub Bot (Jira) Fri, 19 Jan 2024 06:48:05 -0800


    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17808692#comment-17808692
 ]


ASF GitHub Bot commented on MAPREDUCE-7470:
-------------------------------------------

lastbus opened a new pull request, #6469:
URL: https://github.com/apache/hadoop/pull/6469

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   In cloud environment, such as aws, aliyun etc., the internet delay is 
non-trival when we commit thounds of files.
   
   In our situation, the ping delay is about 0.03ms in IDC, but when move to 
Coud, the ping delay is about 3ms, which is roughly 100x slower. We found that, 
committing tens thounds of files will cost a few tens of minutes. The more 
files there are, the logger it takes.
   
   So we propose a new committer algorithm, which is a variant of committer 
algorithm version 1, called 3. In this new algorithm 3, in order to decrease 
the committer time, we use a thread pool to commit job's final output.
   
   Our test result in Cloud production shows that, the new algorithm 3 has 
decrease the committer time by serveral tens of times.
   
   ### How was this patch tested?
   
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> multi-thread mapreduce committer
> --------------------------------
>
>                 Key: MAPREDUCE-7470
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-7470
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mrv2
>            Reporter: TianyiMa
>            Priority: Major
>              Labels: mapreduce
>         Attachments: MAPREDUCE-7470.0.patch
>
>
> In cloud environment, such as aws, aliyun etc., the internet delay is 
> non-trival when we commit thounds of files.
> In our situation, the ping delay is about 0.03ms in IDC, but when move to 
> Coud, the ping delay is about 3ms, which is roughly 100x slower. We found 
> that, committing tens thounds of files will cost a few tens of minutes. The 
> more files there are, the logger it takes.
> So we propose a new committer algorithm, which is a variant of committer 
> algorithm version 1, called 3. In this new algorithm 3, in order to decrease 
> the committer time, we use a thread pool to commit job's final output.
> Our test result in Cloud production shows that, the new algorithm 3 has 
> decrease the committer time by serveral tens of times.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-7470) multi-thread mapreduce committer

Reply via email to