[ 
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886349#comment-15886349
 ] 

Steve Loughran commented on HADOOP-14086:
-----------------------------------------

nothing, yet, I'm just scared about what could be done.

if you look at HADOOP-11694 you can see what will be coming your way, the big 
one is HADOOP-13208, as it can go from treewalking a mocked directory tree into 
direct object store API calls.

if you can use the listFiles calls here then again: significant speedup, 
especially at scale.

That listfiles call also returns a remote iterator with {{LocatedFileStatus()}} 
instances; it is up to the implementation to see if they could optimise it. 
Maybe HDFS could do some stuff here too, e.g. async refresh of the next batch 
of entries while the first lot is being copied

Note also HADOOP-13169; randomizing file listing to spread load across shards 
in s3, so boosting both read and write performance.

> Improve DistCp Speed for small files
> ------------------------------------
>
>                 Key: HADOOP-14086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14086
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.6.5
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a 
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We 
> should restructure the code to reduce the number of NameNode calls as much as 
> possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to