[jira] [Updated] (HADOOP-13169) Randomize file list in SimpleCopyListing

Rajesh Balamohan (JIRA) Sun, 11 Sep 2016 21:42:02 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated HADOOP-13169:
--------------------------------------
    Attachment: HADOOP-13169-branch-2-004.patch


I have run distcp in S3. Results show that there is good improvement with the 
patch (>100% reduction in overall runtime in the distcp I had tried out with). 
E.g result is given below.

{noformat}
tools.SimpleCopyListing: Paths (files+dirs) cnt = 24277; dirCnt = 12031

Without patch:
===========
time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000  
-Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db 
s3a://<bucket>/distcp/12/
real    73m32.806s
user    0m59.452s
sys     0m3.904s

With patch:
===========
time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000  
-Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db 
s3a://<bucket>/distcp/15/
real    33m18.606s
user    0m53.720s
sys     0m3.320s
{noformat}

>From unit-test perspective, TestS3AContractDistCp passed in S3 which isn't 
>executed as a part of normal test suite. There were errors in 
>{{TestS3AContractRootDir}}, but that is unrelated to this change.

For the concurrency part, {{statusList}} would not be changed by workers in 
{{writeToFileListing}}. But added it in synchronized block in the latest patch 
to ensure that it does not impact for any changes later point in time.

Also, noticed that the distcp does not have documentation for some of the 
parameters listed in {{DistcpConstants}}. E.g "{{distcp.work.path}}, 
{{distcp.log.path}}, {{distcp.listing.file.path}}" etc. Will create
separate ticket to address documentation fixes.


> Randomize file list in SimpleCopyListing
> ----------------------------------------
>
>                 Key: HADOOP-13169
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13169
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>            Reporter: Rajesh Balamohan
>            Priority: Minor
>         Attachments: HADOOP-13169-branch-2-001.patch, 
> HADOOP-13169-branch-2-002.patch, HADOOP-13169-branch-2-003.patch, 
> HADOOP-13169-branch-2-004.patch
>
>
> When copying files to S3, based on file listing some mappers can get into S3 
> partition hotspots. This would be more visible, when data is copied from hive 
> warehouse with lots of partitions (e.g date partitions). In such cases, some 
> of the tasks would tend to be a lot more slower than others. It would be good 
> to randomize the file paths which are written out in SimpleCopyListing to 
> avoid this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HADOOP-13169) Randomize file list in SimpleCopyListing

Reply via email to