[
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886349#comment-15886349
]
Steve Loughran commented on HADOOP-14086:
-----------------------------------------
nothing, yet, I'm just scared about what could be done.
if you look at HADOOP-11694 you can see what will be coming your way, the big
one is HADOOP-13208, as it can go from treewalking a mocked directory tree into
direct object store API calls.
if you can use the listFiles calls here then again: significant speedup,
especially at scale.
That listfiles call also returns a remote iterator with {{LocatedFileStatus()}}
instances; it is up to the implementation to see if they could optimise it.
Maybe HDFS could do some stuff here too, e.g. async refresh of the next batch
of entries while the first lot is being copied
Note also HADOOP-13169; randomizing file listing to spread load across shards
in s3, so boosting both read and write performance.
> Improve DistCp Speed for small files
> ------------------------------------
>
> Key: HADOOP-14086
> URL: https://issues.apache.org/jira/browse/HADOOP-14086
> Project: Hadoop Common
> Issue Type: Improvement
> Components: tools/distcp
> Affects Versions: 2.6.5
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Priority: Minor
>
> When using distcp to copy lots of small files, NameNode naturally becomes a
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls. We
> should restructure the code to reduce the number of NameNode calls as much as
> possible to speed up the copy of small files.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]