Zoran Dimitrijevic created HADOOP-11785:
-------------------------------------------
Summary: Reduce number of listStatus operation in distcp
buildListing()
Key: HADOOP-11785
URL: https://issues.apache.org/jira/browse/HADOOP-11785
Project: Hadoop Common
Issue Type: Improvement
Components: tools/distcp
Affects Versions: 3.0.0
Reporter: Zoran Dimitrijevic
Assignee: Zoran Dimitrijevic
Priority: Minor
Fix For: 3.0.0
Attachments: distcp-liststatus.patch
Distcp was taking long time in copyListing.buildListing() for large source
trees (I was using source of 1.5M files in a tree of about 50K directories).
For input at s3 buildListing was taking more than one hour. I've noticed a
performance bug in the current code which does listStatus twice for each
directory which doubles number of RPCs in some cases (if most directories do
not contain >1000 files).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)