[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-25 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606292#comment-15606292
 ] 

Kihwal Lee commented on HDFS-11015:
---

[~zhz], thanks for reviews and commits.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Fix For: 2.8.0, 2.7.4, 3.0.0-alpha2
>
> Attachments: HDFS-11015-1.patch, HDFS-11015-2.patch, 
> HDFS-11015-3.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-25 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15606056#comment-15606056
 ] 

Zhe Zhang commented on HDFS-11015:
--

Actually just noticed the original target version is 2.8. I committed to 
branch-2 and branch-2.8. Backporting to branch-2.7 is not very clean. I'm 
working on it.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Fix For: 2.8.0, 3.0.0-alpha2
>
> Attachments: HDFS-11015-1.patch, HDFS-11015-2.patch, 
> HDFS-11015-3.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-25 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605947#comment-15605947
 ] 

Hudson commented on HDFS-11015:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #10674 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/10674/])
HDFS-11015. Enforce timeout in balancer. Contributed by Kihwal Lee. (zhz: rev 
f6367c5f44a88cb5eb7edffb015b10b657504a61)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Balancer.java
* (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/resources/hdfs-default.xml
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSConfigKeys.java
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/balancer/Dispatcher.java


> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Fix For: 3.0.0-alpha2
>
> Attachments: HDFS-11015-1.patch, HDFS-11015-2.patch, 
> HDFS-11015-3.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-25 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605901#comment-15605901
 ] 

Zhe Zhang commented on HDFS-11015:
--

+1 on the updated patch. Thanks Kihwal.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch, HDFS-11015-2.patch, 
> HDFS-11015-3.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-25 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15605285#comment-15605285
 ] 

Kihwal Lee commented on HDFS-11015:
---

The failed tests all pass when run on my machine. 

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch, HDFS-11015-2.patch, 
> HDFS-11015-3.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15603548#comment-15603548
 ] 

Hadoop QA commented on HDFS-11015:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
12s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  7m 
15s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
52s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
13s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
41s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
42s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 28s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 476 unchanged - 1 fixed = 480 total (was 477) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
47s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
 9s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
2s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 19s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
27s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 85m 18s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.datanode.TestDataNodeUUID |
|   | hadoop.hdfs.server.datanode.TestDirectoryScanner |
|   | hadoop.hdfs.server.datanode.fsdataset.impl.TestSpaceReservation |
|   | hadoop.hdfs.server.datanode.TestDataNodeVolumeFailure |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Issue | HDFS-11015 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12835020/HDFS-11015-3.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  xml  |
| uname | Linux 6501c4a90a8e 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 
17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / a1a0281 |
| Default Java | 1.8.0_101 |
| findbugs | v3.0.0 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/17266/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/17266/artifact/patchprocess/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 

[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-21 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595313#comment-15595313
 ] 

Kihwal Lee commented on HDFS-11015:
---

Heh. Looks like the unit tests are doing their job.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch, HDFS-11015-2.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595232#comment-15595232
 ] 

Hadoop QA commented on HDFS-11015:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
15s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
1s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  6m 
39s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
44s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
30s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
45s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
54s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 29s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 4 new + 477 unchanged - 1 fixed = 481 total (was 478) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
10s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 2 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
3s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m  0s{color} 
| {color:red} hadoop-hdfs in the patch failed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
19s{color} | {color:red} The patch generated 2 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black} 84m 10s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes |
|   | hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFS |
|   | hadoop.hdfs.server.datanode.TestDataNodeLifeline |
|   | hadoop.hdfs.server.balancer.TestBalancer |
|   | hadoop.hdfs.server.balancer.TestBalancerWithSaslDataTransfer |
|   | hadoop.hdfs.server.balancer.TestBalancerWithEncryptedTransfer |
|   | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped |
|   | hadoop.hdfs.server.balancer.TestBalancerWithNodeGroup |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Issue | HDFS-11015 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12834648/HDFS-11015-2.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  xml  |
| uname | Linux 2e64e560c25d 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 
20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / 754cb4e |
| Default Java | 

[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-18 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15586821#comment-15586821
 ] 

Zhe Zhang commented on HDFS-11015:
--

Thanks Kihwal for sharing the results; they look really interesting!

bq. However, the removal of the long timeout wasn't necessary. The timeout is 
still useful for avoiding hung nodes and does not abort slow moves.
I agree. We should use a combination of {{IN_PROGRESS}} heartbeating and socket 
timeout.

Regarding {{DFS_BALANCER_BLOCK_MOVE_TIMEOUT}}:
# Since it's named and documented as a Balancer config knob, I guess we should 
initiate it inside {{Balancer}} and pass it to {{Dispatcher}}? Otherwise 
{{Mover}} is silently affected by this config.
# Nit: "3 to 5 minute" should be "3 to 5 minutes"

Regarding the 20 mins timeout:
# There are only two {{while}} loops in {{Dispatcher}} so I think your patch 
does a pretty good job in enforcing the timeout. Can we simplify the loops by 
merging the conditions into the {{while}} statement? Especially the one in 
{{dispatchBlocks}}, {{isTimeUp}} isn't necessary.

Overall the patch looks pretty good. I realized that it doesn't really cover 
the HDFS-10977 scenario. May I suggest we update the subject to "Enforce 
iteration time limit and slow block move timeout in Balancer"? It doesn't look 
easy to add a unit test and I'm OK in this case because the added logic is 
pretty straightforward. I'm happy to open a separate test JIRA though.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-18 Thread Kihwal Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585860#comment-15585860
 ] 

Kihwal Lee commented on HDFS-11015:
---

Here is the graph that shows the improvement from an expanding (i.e. a lot to 
balance) big cluster.  The thing long tails are eliminated by adding block move 
timeout and read timeout. The multi-minute ramp-down tail is gone after 
enforcing the iteration limit.

!https://issues.apache.org/jira/secure/attachment/12833974/balancer.png!

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch, balancer.png
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-17 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584181#comment-15584181
 ] 

Zhe Zhang commented on HDFS-11015:
--

[~mwagner] Has been working on a related issue HDFS-10966.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-17 Thread Zhe Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584179#comment-15584179
 ] 

Zhe Zhang commented on HDFS-11015:
--

Thanks for the patch Kihwal. A while ago I reported a similar issue HDFS-10977.

> Enforce timeout in balancer
> ---
>
> Key: HDFS-11015
> URL: https://issues.apache.org/jira/browse/HDFS-11015
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-11015-1.patch
>
>
> 1) Hung node detection: HDFS-6247 has removed the socket read timeout while 
> adding the periodic response for slow block moves. However, the removal of 
> the long timeout wasn't necessary.  The timeout is still useful for avoiding 
> hung nodes and does not abort slow moves.
> 2) Enforcing the iteration limit:The 20 minute iteration limit is supposed to 
> be enforced, but it is not. An iteration can easily stretch to 30 to 40 
> minutes with a long tail. Because of the long tails, the balancer throughput 
> does not reach its full potential.
> 3) Slow move detection: For improved throughput, imposing block move timeout 
> is sometimes necessary.  We have seen an iteration taking over 2 hours 
> because of one slow block move.  This is mainly for catching exceptionally 
> slow moves.  Even if the balancer stops waiting, the move will continue and 
> finish.
> In order to not undo what  HDFS-6247 tried to achieve, it should be possible 
> to configure off 3).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11015) Enforce timeout in balancer

2016-10-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583217#comment-15583217
 ] 

Hadoop QA commented on HDFS-11015:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
17s{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  8m 
 1s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
47s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
32s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
57s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
12s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
48s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
40s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  0m 
49s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange}  
0m 27s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch 
generated 3 new + 440 unchanged - 0 fixed = 443 total (was 440) {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  0m 
51s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green}  0m 
11s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} whitespace {color} | {color:red}  0m  
0s{color} | {color:red} The patch has 1 line(s) that end in whitespace. Use git 
apply --whitespace=fix <>. Refer https://git-scm.com/docs/git-apply 
{color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  1m 
55s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
38s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 59m  
2s{color} | {color:green} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
20s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 79m 28s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:9560f25 |
| JIRA Issue | HDFS-11015 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12833775/HDFS-11015-1.patch |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  xml  |
| uname | Linux 82c0b333ffc0 3.13.0-92-generic #139-Ubuntu SMP Tue Jun 28 
20:42:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/hadoop/patchprocess/precommit/personality/provided.sh 
|
| git revision | trunk / ed9fcbe |
| Default Java | 1.8.0_101 |
| findbugs | v3.0.0 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HDFS-Build/17184/artifact/patchprocess/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt
 |
| whitespace | 
https://builds.apache.org/job/PreCommit-HDFS-Build/17184/artifact/patchprocess/whitespace-eol.txt
 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HDFS-Build/17184/testReport/ |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
| Console output | 
https://builds.apache.org/job/PreCommit-HDFS-Build/17184/console |
| Powered by | Apache Yetus 0.4.0-SNAPSHOT