[jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files

2017-02-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886349#comment-15886349
 ] 

Steve Loughran commented on HADOOP-14086:
-

nothing, yet, I'm just scared about what could be done.

if you look at HADOOP-11694 you can see what will be coming your way, the big 
one is HADOOP-13208, as it can go from treewalking a mocked directory tree into 
direct object store API calls.

if you can use the listFiles calls here then again: significant speedup, 
especially at scale.

That listfiles call also returns a remote iterator with {{LocatedFileStatus()}} 
instances; it is up to the implementation to see if they could optimise it. 
Maybe HDFS could do some stuff here too, e.g. async refresh of the next batch 
of entries while the first lot is being copied

Note also HADOOP-13169; randomizing file listing to spread load across shards 
in s3, so boosting both read and write performance.

> Improve DistCp Speed for small files
> 
>
> Key: HADOOP-14086
> URL: https://issues.apache.org/jira/browse/HADOOP-14086
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.6.5
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a 
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We 
> should restructure the code to reduce the number of NameNode calls as much as 
> possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files

2017-02-27 Thread Erik Krogen (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886105#comment-15886105
 ] 

Erik Krogen commented on HADOOP-14086:
--

[~zhz] currently there are multiple calls made for each file; even reducing a 
distcp for 1M files to 1M {{getFileInfo}} calls would be a big improvement over 
the current implementation.

[~ste...@apache.org], what about this JIRA makes you worry that object store 
performance will be worse? Nothing stands out to me so I am curious. Also, are 
you saying that the listFiles performance work is already done, or under 
progress? Do you have a JIRA link?

> Improve DistCp Speed for small files
> 
>
> Key: HADOOP-14086
> URL: https://issues.apache.org/jira/browse/HADOOP-14086
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.6.5
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a 
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We 
> should restructure the code to reduce the number of NameNode calls as much as 
> possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files

2017-02-16 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869708#comment-15869708
 ] 

Steve Loughran commented on HADOOP-14086:
-

# target version will have to be branch-2+, with backports as people feel 
appropriate
# please don't make things worse for object stores. One thing we've started 
doing there is massively boost the performance of listFiles(path, 
recursive=true), which we can take from being a slow emulation of a recursive 
treewalk to an O(1+ files/5000) call. If you could use that to iterate over the 
LocatedFileStatus entries, then hand off that status data direct to the 
workers, then it'd be great for object stores, while still delivering good NN 
perf

> Improve DistCp Speed for small files
> 
>
> Key: HADOOP-14086
> URL: https://issues.apache.org/jira/browse/HADOOP-14086
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.6.5
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a 
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We 
> should restructure the code to reduce the number of NameNode calls as much as 
> possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-14086) Improve DistCp Speed for small files

2017-02-15 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869273#comment-15869273
 ] 

Zhe Zhang commented on HADOOP-14086:


Thanks Zheng. This will be a very useful improvement. Any idea how to reduce NN 
workload? At the end of the day, if we distcp 1M files we need to call 1M 
{{getFileInfo}}.. We thought about querying the SbNN but haven't investigated 
too far.

> Improve DistCp Speed for small files
> 
>
> Key: HADOOP-14086
> URL: https://issues.apache.org/jira/browse/HADOOP-14086
> Project: Hadoop Common
>  Issue Type: Improvement
>  Components: tools/distcp
>Affects Versions: 2.6.5
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a 
> bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We 
> should restructure the code to reduce the number of NameNode calls as much as 
> possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org