[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-08 Thread Virajith Jalaparti (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391856#comment-16391856
 ] 

Virajith Jalaparti commented on HADOOP-15292:
-

[~ste...@apache.org], Thanks for reviewing and committing it. [~elgoiri] and 
[~chris.douglas], thanks for the reviews.

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch, 
> HADOOP-15292.002.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-08 Thread Rushabh S Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391154#comment-16391154
 ] 

Rushabh S Shah commented on HADOOP-15292:
-

[~ste...@apache.org]: This seems like a performance improvement to distcp tool.
Should we backport to branch-2.8 also ?

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch, 
> HADOOP-15292.002.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-08 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16391120#comment-16391120
 ] 

Hudson commented on HADOOP-15292:
-

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #13796 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/13796/])
HADOOP-15292. Distcp's use of pread is slowing it down. Contributed by (stevel: 
rev 3bd6b1fd85c44354c777ef4fda6415231505b2a4)
* (edit) 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
* (edit) 
hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/ThrottledInputStream.java
* (edit) 
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyMapper.java


> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Assignee: Virajith Jalaparti
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch, 
> HADOOP-15292.002.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-07 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390440#comment-16390440
 ] 

Chris Douglas commented on HADOOP-15292:


+1 

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch, 
> HADOOP-15292.002.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390383#comment-16390383
 ] 

Steve Loughran commented on HADOOP-15292:
-

LGTM. 

Chris. what say you?

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch, 
> HADOOP-15292.002.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-07 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390141#comment-16390141
 ] 

genericqa commented on HADOOP-15292:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 11m 
53s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 15m 
38s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green}  
9m 58s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
29s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
23s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
19s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 13s{color} | {color:orange} hadoop-tools/hadoop-distcp: The patch generated 
1 new + 95 unchanged - 1 fixed = 96 total (was 96) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 
40s{color} | {color:green} hadoop-distcp in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
22s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 67m 12s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | HADOOP-15292 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913424/HADOOP-15292.002.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 343b69026d0a 4.4.0-43-generic #63-Ubuntu SMP Wed Oct 12 
13:48:03 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / e0307e5 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14273/artifact/out/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14273/testReport/ |
| Max. process+thread count | 434 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14273/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org |


[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-07 Thread Virajith Jalaparti (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390047#comment-16390047
 ] 

Virajith Jalaparti commented on HADOOP-15292:
-

 [^HADOOP-15292.002.patch] fixes [~ste...@apache.org]'s and [~chris.douglas]'s 
comment of seeking when {{sourceOffset != inStream.getPos()}}. 

[~ste...@apache.org] {{ITestAzureNativeContractDistCp}} with this fix.

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch, 
> HADOOP-15292.002.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-07 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389616#comment-16389616
 ] 

Steve Loughran commented on HADOOP-15292:
-

# I like the extra instrumentation & probes; if it works for HDFS it'll be the 
same everywhere
# I think chris's comment about {{sourceOffset != inStream.getPos()}} seems 
valid. If the file is newly opened, this is the same as offset!=0, otherwise 
its relative to where you are.

w.r.t S3 testing, I can see why it wouldn't be your default, but our test 
suites are designed to be very low cost (no persistent data, bias to uploads 
and large D/Ls all from AWS funded buckets). It's worth getting set up for this 
to help verify consistent behaviour everywhere. 

At the very least, make sure the Azure WASB store tests are happy. (you don't 
get an ADL test until HADOOP-15209). 

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-06 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16389006#comment-16389006
 ] 

genericqa commented on HADOOP-15292:


| (/) *{color:green}+1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 17m 
45s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
56s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
22s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
26s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 32s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
20s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 13s{color} | {color:orange} hadoop-tools/hadoop-distcp: The patch generated 
1 new + 95 unchanged - 1 fixed = 96 total (was 96) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m  2s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
16s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 
57s{color} | {color:green} hadoop-distcp in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
24s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 76m 54s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | HADOOP-15292 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913313/HADOOP-15292.001.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 7ba12e181b17 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 
11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / edf9445 |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14267/artifact/out/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14267/testReport/ |
| Max. process+thread count | 342 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
| Console output | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14267/console |
| Powered by | Apache Yetus 0.8.0-SNAPSHOT   http://yetus.apache.org 

[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-06 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388959#comment-16388959
 ] 

Chris Douglas commented on HADOOP-15292:


{noformat}
+  if (sourceOffset != 0) {
+inStream.seek(sourceOffset);
{noformat}
Should this be {{sourceOffset != inStream.getPos()}} ?

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 2.5.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-06 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388934#comment-16388934
 ] 

Íñigo Goiri commented on HADOOP-15292:
--

Thanks [~virajith] and [~chris.douglas].
I missed the last part:
{code}
assertCounter(readCounter, readsFromClient + numFiles, rb);
{code}
I think that's actually enough.
Just to double check, before the change, the unit test failed in this assert, 
right?

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-06 Thread Virajith Jalaparti (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388928#comment-16388928
 ] 

Virajith Jalaparti commented on HADOOP-15292:
-

bq.  Probably not worth adding metrics but maybe extend the stream in the unit 
test and track how many times we open it.

[~elgoiri], I extended {{TestCopyMapper#testCopyWithAppend}} to test the number 
of calls to {{readBlock}} on the Datanode (this is captured by the metric 
{{readsFromLocalClient}}), in  [^HADOOP-15292.001.patch]. The buffer size is 
set such that if pread was used, this metric would increase much more than the 
number of files that are being appended to.

[~ste...@apache.org] I tested this over an internal filesystem. This issue came 
up when I was using distcp to copy data from the filesystem using the tiered 
HDFS feature (HDFS-9806). This particular filesystem throttled calls to open() 
beyond a certain QPS. For large enough data, distcp never succeeded as it 
caused way too many calls to open() (pread causes a separate InputStream to be 
opened for the {{ProvidedReplica}} for each 8k chunks of data, which result in 
an open() call to the remote filesystem). With this fix, Distcp runs fine, and 
I don't see the throttling any more. 
I think this goes towards to the "extra rigorousness" you are looking for. I am 
not really setup to test this with s3.


> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch, HADOOP-15292.001.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of which 
> requires the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-06 Thread Chris Douglas (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16388859#comment-16388859
 ] 

Chris Douglas commented on HADOOP-15292:


Instead of passing a flag to {{readBytes}}, this can just call {{seek()}} 
outside the loop (and include the {{getPos() != position}} optimization).

[~ste...@apache.org] are you set up to test S3? {{pread}} happens to have an 
expensive implementation in HDFS (and other {{FileSystem}} impls), but creating 
a test for distcp to ensure the {{PositionedReadable}} APIs aren't used seems 
excessive.

bq. Not sure if it's worth extending that unit test to track how many times we 
open the stream.
>From the description, it's inside the DN where {{pread}} creates multiple 
>streams. IIRC the position of the stream isn't updated when using PR APIs. If 
>the stream were shared that could be an issue, but that's not in the design. 
>In HDFS, updating the set of locations for each read (without checking the 
>distcp invariants) is also unused, here.

Demonstrating the fix with a demo in HDFS would be sufficient for commit, IMO. 
It might be possible to add a test around the command itself to ensure the 
{{seek()}} is correct on retry, but wiring the flaw into a test would require a 
{{MiniDFSCluster}}.

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of requires 
> the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-06 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387551#comment-16387551
 ] 

Steve Loughran commented on HADOOP-15292:
-

I am only now becoming aware of how some aspects of DistCp are inefficient when 
used against object stores (HADOOP-15281, HADOOP-15209); this looks like 
another. 


# The pread is potentially most underperforming in object stores which don't do 
good seeks on input streams: s3a in the distant past, swift, 
# I'd like to see what happens when you test that in hadoop-aws (it's a scale 
test, which needs -Dscale and {{ITestAzureNativeContractDistCp}}. 
# Stream opening count be measured in  {{ITestS3AContractDistCp}}, if a test 
for this was wired up & FileSystem.Statistics queried before and after the 
operation. 
# L305: could add a check for object store clients which seek even when desired 
== actual pos by wrapping seek() with {{if (getPos() != position}}

And for extra rigorousness, build up hadoop and try doing some real distcp with 
src/dest any of the stores and ftp, which is used as a backup source.

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>  Components: tools/distcp
>Affects Versions: 3.0.0
>Reporter: Virajith Jalaparti
>Priority: Minor
> Attachments: HADOOP-15292.000.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of requires 
> the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-05 Thread genericqa (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387238#comment-16387238
 ] 

genericqa commented on HADOOP-15292:


| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 
14s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
23s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
16s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
25s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
10m 33s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
18s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
24s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
21s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 13s{color} | {color:orange} hadoop-tools/hadoop-distcp: The patch generated 
1 new + 12 unchanged - 0 fixed = 13 total (was 12) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
22s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
11m 26s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  0m 
43s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
17s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 14m 
56s{color} | {color:green} hadoop-distcp in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
21s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 59m 12s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:d4cc50f |
| JIRA Issue | HADOOP-15292 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12913138/HADOOP-15292.000.patch
 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  shadedclient  findbugs  checkstyle  |
| uname | Linux 3dd3887ac5e8 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 
11:55:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / 745190e |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_151 |
| findbugs | v3.1.0-RC1 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14261/artifact/out/diff-checkstyle-hadoop-tools_hadoop-distcp.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HADOOP-Build/14261/testReport/ |
| Max. process+thread count | 301 (vs. ulimit of 1) |
| modules | C: hadoop-tools/hadoop-distcp U: hadoop-tools/hadoop-distcp |
| Console output | 

[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-05 Thread JIRA

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387190#comment-16387190
 ] 

Íñigo Goiri commented on HADOOP-15292:
--

Thanks [~virajith] for the patch.
{{TestCopyMapper}} tests this behavior so we can check that this doesn't break.
Not sure if it's worth extending that unit test to track how many times we open 
the stream.
Probably not worth adding metrics but maybe extend the stream in the unit test 
and track how many times we open it.

[~jingzhao] you implemented MAPREDUCE-5899, do you mind double checking that 
this approach is correct?

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Virajith Jalaparti
>Priority: Major
> Attachments: HADOOP-15292.000.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of requires 
> the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org



[jira] [Commented] (HADOOP-15292) Distcp's use of pread is slowing it down.

2018-03-05 Thread Virajith Jalaparti (JIRA)

[ 
https://issues.apache.org/jira/browse/HADOOP-15292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387182#comment-16387182
 ] 

Virajith Jalaparti commented on HADOOP-15292:
-

Attached patch fixes this issue by replacing the positioned read with an 
initial seek, and reading the remaining data directly from the open stream.

> Distcp's use of pread is slowing it down.
> -
>
> Key: HADOOP-15292
> URL: https://issues.apache.org/jira/browse/HADOOP-15292
> Project: Hadoop Common
>  Issue Type: Bug
>Reporter: Virajith Jalaparti
>Priority: Major
> Attachments: HADOOP-15292.000.patch
>
>
> Distcp currently uses positioned-reads (in 
> RetriableFileCopyCommand#copyBytes) when the source offset is > 0. This 
> results in unnecessary overheads (new BlockReader being created on the 
> client-side, multiple readBlock() calls to the Datanodes, each of requires 
> the creation of a BlockSender and an inputstream to the ReplicaInfo).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org