[
https://issues.apache.org/jira/browse/HADOOP-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated HADOOP-13169:
--------------------------------------
Attachment: HADOOP-13169-branch-2-004.patch
I have run distcp in S3. Results show that there is good improvement with the
patch (>100% reduction in overall runtime in the distcp I had tried out with).
E.g result is given below.
{noformat}
tools.SimpleCopyListing: Paths (files+dirs) cnt = 24277; dirCnt = 12031
Without patch:
===========
time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000
-Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db
s3a://<bucket>/distcp/12/
real 73m32.806s
user 0m59.452s
sys 0m3.904s
With patch:
===========
time hadoop distcp -Ddistcp.simplelisting.file.status.size=50000
-Ddistcp.liststatus.threads=15 s3a://<bucket>/tpcds_bin_partitioned_orc_200.db
s3a://<bucket>/distcp/15/
real 33m18.606s
user 0m53.720s
sys 0m3.320s
{noformat}
>From unit-test perspective, TestS3AContractDistCp passed in S3 which isn't
>executed as a part of normal test suite. There were errors in
>{{TestS3AContractRootDir}}, but that is unrelated to this change.
For the concurrency part, {{statusList}} would not be changed by workers in
{{writeToFileListing}}. But added it in synchronized block in the latest patch
to ensure that it does not impact for any changes later point in time.
Also, noticed that the distcp does not have documentation for some of the
parameters listed in {{DistcpConstants}}. E.g "{{distcp.work.path}},
{{distcp.log.path}}, {{distcp.listing.file.path}}" etc. Will create
separate ticket to address documentation fixes.
> Randomize file list in SimpleCopyListing
> ----------------------------------------
>
> Key: HADOOP-13169
> URL: https://issues.apache.org/jira/browse/HADOOP-13169
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Rajesh Balamohan
> Priority: Minor
> Attachments: HADOOP-13169-branch-2-001.patch,
> HADOOP-13169-branch-2-002.patch, HADOOP-13169-branch-2-003.patch,
> HADOOP-13169-branch-2-004.patch
>
>
> When copying files to S3, based on file listing some mappers can get into S3
> partition hotspots. This would be more visible, when data is copied from hive
> warehouse with lots of partitions (e.g date partitions). In such cases, some
> of the tasks would tend to be a lot more slower than others. It would be good
> to randomize the file paths which are written out in SimpleCopyListing to
> avoid this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]