[
https://issues.apache.org/jira/browse/HADOOP-13169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rajesh Balamohan updated HADOOP-13169:
--------------------------------------
Attachment: HADOOP-13169-branch-2-005.patch
Sharing the HDFS based numbers based on the patch:
{noformat}
Source data size
448782412981 /apps/hive/warehouse/tpcds_bin_partitioned_orc.db
Without Patch:
==========
real 21m46.126s
user 0m42.566s
sys 0m3.282s
With Patch:
==========
real 12m23.091s
user 0m38.096s
sys 0m2.686s
{noformat}
This was on 20 node cluster, which shows good improvement with HDFS as well.
With randomization, CopyMapper could get better locality when copying over data
in HDFS. Without the patch, CopyMapper could end up reading data remotely for
file copying (e.g file paths from the listing
"web_returns/wr_returned_date_sk=2452626/000135_0,
web_returns/wr_returned_date_sk=2452624/000121_0 ...").
Recent patch also has the option to turn off this feature on optional basis
{{distcp.simplelisting.randomize.files=false}},
{{distcp.simplelisting.file.status.size=1000}}. Also included a test case
in {{TestCopyListing}}
> Randomize file list in SimpleCopyListing
> ----------------------------------------
>
> Key: HADOOP-13169
> URL: https://issues.apache.org/jira/browse/HADOOP-13169
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Reporter: Rajesh Balamohan
> Priority: Minor
> Attachments: HADOOP-13169-branch-2-001.patch,
> HADOOP-13169-branch-2-002.patch, HADOOP-13169-branch-2-003.patch,
> HADOOP-13169-branch-2-004.patch, HADOOP-13169-branch-2-005.patch
>
>
> When copying files to S3, based on file listing some mappers can get into S3
> partition hotspots. This would be more visible, when data is copied from hive
> warehouse with lots of partitions (e.g date partitions). In such cases, some
> of the tasks would tend to be a lot more slower than others. It would be good
> to randomize the file paths which are written out in SimpleCopyListing to
> avoid this issue.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]