[jira] [Commented] (HADOOP-18739) Parallelize concatenation of distcp chunks of separate files in CopyCommitter

ASF GitHub Bot (Jira) Wed, 10 May 2023 14:38:29 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17721553#comment-17721553
 ]


ASF GitHub Bot commented on HADOOP-18739:
-----------------------------------------

developersarm opened a new pull request, #5640:
URL: https://github.com/apache/hadoop/pull/5640

   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: 
https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 
'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   While copying a folder containing large files consisting of multiple distcp 
chunks, copy committer synchronously picks chunks of each file and concatenates 
them. This is very slow.
   As part of this PR, we are parallelising the concatenation of distcp chunks 
of separate files with a default thread pool of size 10.
   
   ### How was this patch tested?
   * Existing unit tests are passing which are testing both failure and success 
scenarios.
   * Tested the patch on distributed cloud setup by running a distcp job for 
copying a folder of size 100GB having 20 files of size 5GB from one cluster to 
another. Noticed an improvement of 2 mins in latency.
   
   ### For code changes:
   
   - [x] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




> Parallelize concatenation of distcp chunks of separate files in CopyCommitter
> -----------------------------------------------------------------------------
>
>                 Key: HADOOP-18739
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18739
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Abhay Yadav
>            Priority: Trivial
>
> While copying a folder containing large files consisting of multiple distcp 
> chunks, copy committer synchronously picks chunks of each file and 
> concatenates them. This part can be improved by parallelizing the 
> concatenation of distcp chunks of separate files. We are able to save 2-3 
> minutes while copying a folder of 100 GB containing 20 files of 5GB size with 
> this improvement.
> Contributing a patch for this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-18739) Parallelize concatenation of distcp chunks of separate files in CopyCommitter

Reply via email to