[jira] [Commented] (HADOOP-15384) distcp numListstatusThreads option doesn't get to -delete scan

2018-07-10 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16538339#comment-16538339
 ] 

Hudson commented on HADOOP-15384:
-

FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #14548 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/14548/])
HADOOP-15384. distcp numListstatusThreads option doesn't get to -delete 
(stevel: rev ca8b80bf59c0570bb9172208d3a6c993a6854514)
* (edit) 
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/contract/AbstractContractDistCpTest.java
* (edit) 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
* (edit) 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java


> distcp numListstatusThreads option doesn't get to -delete scan
> --
>
> Key: HADOOP-15384
> URL: https://issues.apache.org/jira/browse/HADOOP-15384
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: tools/distcp
>Affects Versions: 3.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Fix For: 3.1.1
>
> Attachments: HADOOP-15384-001.patch
>
>
> The distcp {{numListstatusThreads}} option isn't used when configuring the 
> GlobbedCopyListing used in {{CopyComitter.deleteMissing()}}
> This means that for large scans of object stores, performance is 
> significantly worse.
> Fix: pass the option down from the task conf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15384) distcp numListstatusThreads option doesn't get to -delete scan

2018-07-09 Thread Ewan Higgs (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16536897#comment-16536897
 ] 

Ewan Higgs commented on HADOOP-15384:
-

I've tested this using 1, 5, and 20 threads and we get the expected performance 
improvement when constructing the source list.

+1

> distcp numListstatusThreads option doesn't get to -delete scan
> --
>
> Key: HADOOP-15384
> URL: https://issues.apache.org/jira/browse/HADOOP-15384
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: tools/distcp
>Affects Versions: 3.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15384-001.patch
>
>
> The distcp {{numListstatusThreads}} option isn't used when configuring the 
> GlobbedCopyListing used in {{CopyComitter.deleteMissing()}}
> This means that for large scans of object stores, performance is 
> significantly worse.
> Fix: pass the option down from the task conf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15384) distcp numListstatusThreads option doesn't get to -delete scan

2018-07-02 Thread Ewan Higgs (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16529875#comment-16529875
 ] 

Ewan Higgs commented on HADOOP-15384:
-

The code LGTM. Trying to test this to see if there is a significant performance 
impact on S3.

> distcp numListstatusThreads option doesn't get to -delete scan
> --
>
> Key: HADOOP-15384
> URL: https://issues.apache.org/jira/browse/HADOOP-15384
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: tools/distcp
>Affects Versions: 3.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15384-001.patch
>
>
> The distcp {{numListstatusThreads}} option isn't used when configuring the 
> GlobbedCopyListing used in {{CopyComitter.deleteMissing()}}
> This means that for large scans of object stores, performance is 
> significantly worse.
> Fix: pass the option down from the task conf



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15384) distcp numListstatusThreads option doesn't get to -delete scan

2018-06-25 Thread genericqa (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16523025#comment-16523025
 ] 

genericqa commented on HADOOP-15384:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
23s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 27m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
24s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 16s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
25s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
12m 11s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
37s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 12m 
36s{color} | {color:green} hadoop-distcp in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 68m 30s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd |
| JIRA Issue | HADOOP-15384 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12929072/HADOOP-15384-001.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 06f7a1a1fde3 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 
19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 7a3c6e9 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC1 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14818/testReport/ |
| Max. process+thread count | 335 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14818/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


This message was automatically generated.



> distcp numListstatusThreads option doesn't get to -delete scan
> --
>
> Key: HADOOP-15384
>   

[jira] [Commented] (HADOOP-15384) distcp numListstatusThreads option doesn't get to -delete scan

2018-06-25 Thread Steve Loughran (JIRA)


[ 
https://issues.apache.org/jira/browse/HADOOP-15384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522708#comment-16522708
 ] 

Steve Loughran commented on HADOOP-15384:
-

Patch 001: passes thread count down, logs it, and splits time for reporting 
listing into source and dest

This has tangible improvements when using object stores as a destination, 
though the mimicking of directories trees can still make distcp to some remote 
stores (s3, swift) still tangibly awful. This is all you can do short of a 
complete rewrite. Which I don't intend to propose for the following reason: 
distcp is a complex and critical part of too many people's workflows.

Testing: ran {{ITestS3AContractDistCp}} against S3 Ireland. No new tests, as 
the distcp contract tests were already setting num threads: I have set the 
thread count to 40 though, that being the maximum.

Here's the output of the relevant phase of 
{{testUpdateDeepDirectoryStructureToRemote}}

{code}
2018-06-25 19:35:18,047 [Thread-139] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(387)) - -delete option is enabled. About to 
remove entries from target that are missing in source
2018-06-25 19:35:18,062 [Thread-139] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(396)) - Source listing completed in 
0:00:00.015
2018-06-25 19:35:18,063 [Thread-139] INFO  mapred.CopyCommitter 
(CopyCommitter.java:listTargetFiles(554)) - Scanning destination directory 
s3a://hwdev-steve-ireland-new/test/ITestS3AContractDistCp/testUpdateDeepDirectoryStructureToRemote/remote/DELAY_LISTING_ME/outputDir/inputDir
 with thread count: 40
2018-06-25 19:35:19,872 [Thread-139] INFO  tools.SimpleCopyListing 
(SimpleCopyListing.java:printStats(608)) - Paths (files+dirs) cnt = 11; dirCnt 
= 5
2018-06-25 19:35:19,872 [Thread-139] INFO  tools.SimpleCopyListing 
(SimpleCopyListing.java:doBuildListing(402)) - Build file listing completed.
2018-06-25 19:35:19,886 [Thread-139] INFO  tools.DistCp 
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 11
2018-06-25 19:35:19,899 [Thread-139] INFO  tools.DistCp 
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 11
2018-06-25 19:35:19,913 [Thread-139] INFO  mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(415)) - Destination listing completed in 
0:00:01.851
{code}

And for {{ITestAzureNativeContractDistCp}}
{code}
2018-06-25 20:11:44,992 INFO  [Thread-147]: mapred.LocalJobRunner 
(LocalJobRunner.java:runTasks(486)) - map task executor complete.
2018-06-25 20:11:44,992 INFO  [Thread-147]: mapred.CopyCommitter 
(CopyCommitter.java:concatFileChunks(210)) - concat file chunks ...
2018-06-25 20:11:45,405 INFO  [Thread-147]: mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(387)) - -delete option is enabled. About to 
remove entries from target that are missing in source
2018-06-25 20:11:45,418 INFO  [Thread-147]: mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(396)) - Source listing completed in 
0:00:00.013
2018-06-25 20:11:45,419 INFO  [Thread-147]: mapred.CopyCommitter 
(CopyCommitter.java:listTargetFiles(554)) - Scanning destination directory 
wasb://contr...@contender.blob.core.windows.net/test/ITestAzureNativeContractDistCp/testUpdateDeepDirectoryStructureToRemote/remote/outputDir/inputDir
 with thread count: 40
2018-06-25 20:11:46,338 INFO  [Thread-147]: tools.SimpleCopyListing 
(SimpleCopyListing.java:printStats(608)) - Paths (files+dirs) cnt = 11; dirCnt 
= 5
2018-06-25 20:11:46,338 INFO  [Thread-147]: tools.SimpleCopyListing 
(SimpleCopyListing.java:doBuildListing(402)) - Build file listing completed.
2018-06-25 20:11:46,351 INFO  [Thread-147]: tools.DistCp 
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 11
2018-06-25 20:11:46,365 INFO  [Thread-147]: tools.DistCp 
(CopyListing.java:buildListing(94)) - Number of paths in the copy list: 11
2018-06-25 20:11:46,377 INFO  [Thread-147]: mapred.CopyCommitter 
(CopyCommitter.java:deleteMissing(415)) - Destination listing completed in 
0:00:00.959
{code}

A small bird fell out the sky, deceased, during the S3A run. It didn't happen 
on a rerun —I'm assuming unrelated. If more wild animals die during S3 
integration tests then it'd be something to consider a significant regression 
in the AWS SDK


+ [~ehiggs], [~fabbri]


> distcp numListstatusThreads option doesn't get to -delete scan
> --
>
> Key: HADOOP-15384
> URL: https://issues.apache.org/jira/browse/HADOOP-15384
> Project: Hadoop Common
>  Issue Type: Sub-task
>  Components: tools/distcp
>Affects Versions: 3.1.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
> Attachments: HADOOP-15384-001.patch
>
>
> The distcp {{numListstatusThreads}} option isn't used when configuring the