[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread Hadoop QA (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042804#comment-17042804
 ] 

Hadoop QA commented on HDFS-15041:
--

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
53s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} trunk Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 
21s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m  
0s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
50s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
8s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
16m  9s{color} | {color:green} branch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
49s{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
39s{color} | {color:green} trunk passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  1m 
 2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
57s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  0m 
46s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green}  1m  
2s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 1s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} xml {color} | {color:green}  0m  
1s{color} | {color:green} The patch has no ill-formed XML file. {color} |
| {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 
14m  7s{color} | {color:green} patch has no errors when building and testing 
our client artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
53s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 54s{color} 
| {color:red} hadoop-hdfs in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
34s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}174m 43s{color} | 
{color:black} {color} |
\\
\\
|| Reason || Tests ||
| Failed junit tests | hadoop.hdfs.TestMultipleNNPortQOP |
|   | hadoop.hdfs.TestRollingUpgrade |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 |
| JIRA Issue | HDFS-15041 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12994228/HDFS-15041.003.patch |
| Optional Tests |  dupname  asflicense  compile  javac  javadoc  mvninstall  
mvnsite  unit  shadedclient  findbugs  checkstyle  xml  |
| uname | Linux cb59fbed169b 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 
08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | /testptch/patchprocess/precommit/personality/provided.sh |
| git revision | trunk / ba9025c |
| maven | version: Apache Maven 3.3.9 |
| Default Java | 1.8.0_242 |
| findbugs | v3.1.0-RC1 |
| unit | 
https://builds.apache.org/job/PreCommit-HDFS-Build/28827/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt
 |
|  Test Results | 

[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042802#comment-17042802
 ] 

Ayush Saxena commented on HDFS-15041:
-

Thanx [~zhuqi] for the update.
v004 LGTM +1 (Pending Jenkins)

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch, HDFS-15041.004.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042798#comment-17042798
 ] 

zhuqi commented on HDFS-15041:
--

cc [~ayushtkn]

Thanks for your review.

I have fixed it.

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch, HDFS-15041.004.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated HDFS-15041:
-
Attachment: HDFS-15041.004.patch
Status: Patch Available  (was: In Progress)

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch, HDFS-15041.004.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated HDFS-15041:
-
Status: In Progress  (was: Patch Available)

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch, HDFS-15041.004.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042795#comment-17042795
 ] 

Ayush Saxena commented on HDFS-15041:
-

Thanx [~zhuqi] for update.

{code:java}
+TimeUnit.MICROSECONDS);
{code}
Shouldn't this be MILLISECONDS?

Here change it to 4ms.

{code:java}
+4
{code}


> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread zhuqi (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042783#comment-17042783
 ] 

zhuqi commented on HDFS-15041:
--

cc [~ayushtkn]    [~weichiu]

I have changed the configuration to support time units. If any other change for 
merging it.

Thanks.

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated HDFS-15041:
-
Attachment: HDFS-15041.003.patch
Status: Patch Available  (was: In Progress)

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, 
> HDFS-15041.003.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread zhuqi (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhuqi updated HDFS-15041:
-
Status: In Progress  (was: Patch Available)

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-22 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780
 ] 

Fei Hui edited comment on HDFS-15186 at 2/23/20 2:52 AM:
-

[~yaoguangdong]Thanks for reporting this. Good Catch.
Sorry for late, I couldn't receive emails these days.
+1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and 
busyindices, this cause this problem. Maybe we should fix it in namenode side.


was (Author: ferhui):
[~yaoguangdong]Thanks for reporting this !Good Catch!
Sorry for late, I couldn't receive emails these days!
+1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and 
busyindices, this cause this problem. Maybe we should fix it in namenode side.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-22 Thread Fei Hui (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780
 ] 

Fei Hui commented on HDFS-15186:


[~yaoguangdong]Thanks for reporting this !Good Catch!
Sorry for late, I couldn't receive emails these days!
+1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and 
busyindices, this cause this problem. Maybe we should fix it in namenode side.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042703#comment-17042703
 ] 

Ayush Saxena commented on HDFS-15041:
-

Thanx [~zhuqi]  for the patch.

we can make the configuration support time units. and get the time using 
{{conf.getTimeDuration(..)}}

> Make MAX_LOCK_HOLD_MS and full queue size configurable
> --
>
> Key: HDFS-15041
> URL: https://issues.apache.org/jira/browse/HDFS-15041
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: zhuqi
>Assignee: zhuqi
>Priority: Major
> Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch
>
>
> Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different 
> cluster have different need for the latency and the queue health standard. 
> We'd better to make the two parameter configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15103) JMX endpoint and "dfsadmin" report 1 corrupt block; "fsck" reports 0

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042702#comment-17042702
 ] 

Ayush Saxena commented on HDFS-15103:
-

The corrupt replica can be from a open file, the fsck has option to include 
open files too.

try using along with that, check fsck with the block id, the corrupt replica 
would be there.

> JMX endpoint and "dfsadmin" report 1 corrupt block; "fsck" reports 0
> 
>
> Key: HDFS-15103
> URL: https://issues.apache.org/jira/browse/HDFS-15103
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.1
> Environment: * CentOS 7
>  * HDFS 3.2.1
>  * 2x HA NNs
>  * 5x identical DNs
>Reporter: Mark Ormesher
>Priority: Major
>
> We're seeing a long-running discrepancy between the number of corrupted 
> blocks reported by the JMX endpoint and {{dfsadmin -report}} (1) and by 
> {{fsck /}} (0). This has persisted through rolling restarts of the NNs and 
> DNs, and through complete shutdowns for the HDFS cluster for unrelated 
> maintenance.
> {panel:title=JMX endpoint snippet}
> {code}
> (...)
> "CorruptBlocks" : 1,
> "ScheduledReplicationBlocks" : 0,
> "PendingDeletionBlocks" : 0,
> "LowRedundancyReplicatedBlocks" : 0,
> "CorruptReplicatedBlocks" : 1,
> "MissingReplicatedBlocks" : 0,
> "MissingReplicationOneBlocks" : 0,
> (...)
> {code}
> {panel}
> {panel:title=dfsadmin -report}
> {code}
> $ ./hdfs dfsadmin -report | grep -i corrupt
>   Blocks with corrupt replicas: 1
>   Block groups with corrupt internal blocks: 0
> {code}
> {panel}
> {panel:title=fsck /}
> {code}
> $ ./hdfs fsck / -files -blocks | grep -i corrupt
>   Corrupt blocks:0
>   Corrupt block groups:  0
> {code}
> {panel}
> I've read through the related tickets below, all of which suggest this issue 
> was resolved in 2.7.8, but we're seeing it in 3.2.1.
> https://issues.apache.org/jira/browse/HDFS-8533
> https://issues.apache.org/jira/browse/HDFS-10213
> https://issues.apache.org/jira/browse/HDFS-13999
> How can we work out whether we really do have a corrupt block, and if we do 
> how can we work out which block it is if {{fsck}} thinks everything is fine?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15114) JournalNodes' committed-txid file includes aborted transaction, breaks NameNode startup

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042696#comment-17042696
 ] 

Ayush Saxena commented on HDFS-15114:
-

Thanx [~Steven Rand]  for the report.

Are you able to reproduce the issue?

Will it be possible to extend a UT for it?

> JournalNodes' committed-txid file includes aborted transaction, breaks 
> NameNode startup
> ---
>
> Key: HDFS-15114
> URL: https://issues.apache.org/jira/browse/HDFS-15114
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: journal-node, namenode
>Affects Versions: 3.2.1
>Reporter: Steven Rand
>Priority: Major
>
> A couple of days ago, our active NameNode in an HA setup aborted a 
> {{QuorumOutputStream}} starting at tx 3389424 because tx 3389425 failed to be 
> written. This was likely related to a rolling restart of the three 
> JournalNodes that was happening at this time. The NameNode logged:
> {code:java}
> 2020-01-11 02:00:50,229 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 
> 10.6.1.181
> 2020-01-11 02:00:50,229 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
> 2020-01-11 02:00:50,229 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 3389424, 
> 3389424
> 2020-01-11 02:00:50,229 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 
> Total time for transactions(ms): 1 Number of transactions batched in Syncs:
>  0 Number of syncs: 1 SyncTimes(ms): 1 7
> 2020-01-11 02:00:50,245 WARN 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 
> 10.6.2.187:8485 failed to write txns 3389425-3389425. Will try to write to 
> this JN again after the next log roll.
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException):
>  Can't write, no segment open ; journal id: 
> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:545)
> ... rest of stacktrace ...
> // the same warning for the second JournalNode
> // the same warning for the third JournalNode
> 2020-01-11 02:00:50,246 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for 
> required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 
> 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream starting at txid 
> 3389424))
> org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many 
> exceptions to achieve quorum size 2/3. 3 exceptions thrown:
> // the same "Can't write, no segment open ; journal id: " error 
> for all 3 JournalNodes
> 2020-01-11 02:00:50,246 WARN 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting 
> QuorumOutputStream starting at txid 3389424
> 2020-01-11 02:00:50,255 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
> status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM 
> to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], 
> stream=QuorumOutputStream starting at txid 3389424))
> {code}
> Even though the stream was aborted, the {{committed-txid}} file on each of 
> the three JournalNodes was updated to be {{3389424}}.
> This caused both NameNodes to fail to start with this error:
>   
> {code:java}
> 2020-01-11 02:54:35,483 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services 
> required for active state
> 2020-01-11 02:54:35,491 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting 
> recovery process for unclosed journal segments...
> 2020-01-11 02:54:35,537 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully 
> started new epoch 80
> 2020-01-11 02:54:35,537 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning 
> recovery of unclosed segment starting at txid 3389422
> 2020-01-11 02:54:35,574 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare 
> phase complete. Responses:
> 10.6.1.4:8485: segmentState { startTxId: 3389422 endTxId: 3389423 
> isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424
> 10.6.2.187:8485: segmentState { startTxId: 3389422 endTxId: 3389423 
> isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424
> 2020-01-11 02:54:35,575 INFO 
> org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest 
> log: 10.6.1.4:8485=segmentState {
>   startTxId: 3389422
>   endTxId: 3389423
>   isInProgress: false
> }
> lastWriterEpoch: 57
> lastCommittedTxId: 3389424
> 2020-01-11 02:54:35,575 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: 
> recoverUnfinalizedSegments failed for required journal 
> (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 
> 10.6.2.187:8485], stream=null))
> 

[jira] [Commented] (HDFS-15124) Crashing bugs in NameNode when using a valid configuration for `dfs.namenode.audit.loggers`

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042695#comment-17042695
 ] 

Ayush Saxena commented on HDFS-15124:
-

Thanx Everyone for the work here.

[~ctest.team] you can check {{TestFsNamesystem.java}} there is a 
{{testInitAuditLoggers()}} which already tests most of the things. You can add 
your case also in the same test. You can add at the end in same the same test 
and do something like this to repro your issue :
{code:java}
// Test Configuring TopAuditLogger.
conf.set(DFSConfigKeys.DFS_NAMENODE_AUDIT_LOGGERS_KEY,
"org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger");
fsn = new FSNamesystem(conf, fsImage);
auditLoggers = fsn.getAuditLoggers();
assertEquals(1, auditLoggers.size());
assertThat(auditLoggers.get(0), instanceOf(TopAuditLogger.class));
{code}
Let me know, if you  intend to write test for something else and facing 
trouble, will try to help write one. :)

> Crashing bugs in NameNode when using a valid configuration for 
> `dfs.namenode.audit.loggers`
> ---
>
> Key: HDFS-15124
> URL: https://issues.apache.org/jira/browse/HDFS-15124
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.10.0
>Reporter: Ctest
>Assignee: Ctest
>Priority: Critical
> Attachments: HDFS-15124.000.patch, HDFS-15124.001.patch, 
> HDFS-15124.002.patch, HDFS-15124.003.patch, HDFS-15124.004.patch, 
> HDFS-15124.005.patch
>
>
> I am using Hadoop-2.10.0.
> The configuration parameter `dfs.namenode.audit.loggers` allows `default` 
> (which is the default value) and 
> `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`.
> When I use `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, 
> namenode will not be started successfully because of an 
> `InstantiationException` thrown from 
> `org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers`. 
> The root cause is that while initializing namenode, `initAuditLoggers` will 
> be called and it will try to call the default constructor of 
> `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` which doesn't 
> have a default constructor. Thus the `InstantiationException` exception is 
> thrown.
>  
> *Symptom*
> *$ ./start-dfs.sh*
> {code:java}
> 2019-12-18 14:05:20,670 ERROR 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem 
> initialization failed.java.lang.RuntimeException: 
> java.lang.InstantiationException: 
> org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1024)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:858)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:677)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:674)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:736)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:961)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:940)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1714)
> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1782)
> Caused by: java.lang.InstantiationException: 
> org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger
> at java.lang.Class.newInstance(Class.java:427)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1017)...
> 8 more
> Caused by: java.lang.NoSuchMethodException: 
> org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger.()
> at java.lang.Class.getConstructor0(Class.java:3082)
> at java.lang.Class.newInstance(Class.java:412)
> ... 9 more{code}
>  
>  
> *Detailed Root Cause*
> There is no default constructor in 
> `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`: 
> {code:java}
> /** 
>  * An {@link AuditLogger} that sends logged data directly to the metrics 
>  * systems. It is used when the top service is used directly by the name node 
>  */ 
> @InterfaceAudience.Private 
> public class TopAuditLogger implements AuditLogger { 
>   public static finalLogger LOG = 
> LoggerFactory.getLogger(TopAuditLogger.class); 
>   private final TopMetrics topMetrics; 
>   public TopAuditLogger(TopMetrics topMetrics) {
> Preconditions.checkNotNull(topMetrics, "Cannot init with a null " + 
> "TopMetrics");
> this.topMetrics = topMetrics; 
>   }
>   @Override
>   public void initialize(Configuration conf) { 
>   }
> {code}
> As long as the configuration parameter `dfs.namenode.audit.loggers` is set to 
> 

[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042631#comment-17042631
 ] 

Ayush Saxena commented on HDFS-15176:
-

Thanx [~LiJinglun] for the patch.
nit :
In the doc, can you surround the configuration names with {{`name`}} to 
highlight them as done even for {{rpc context}} metrics. 
Like this :
{code:java}
| `GcTimePercentage` | The percentage (0..100) of time that the JVM spent in GC 
pauses within the observation window if `dfs.namenode.gc.time.monitor.enable` 
is set to true. Use `dfs.namenode.gc.time.monitor.sleep.interval.ms` to specify 
the sleep interval in msec. Use 
`dfs.namenode.gc.time.monitor.observation.window.ms` to specify the observation 
window in msec. |

{code}

Apart LGTM +1


> Enable GcTimePercentage Metric in NameNode's JvmMetrics.
> 
>
> Key: HDFS-15176
> URL: https://issues.apache.org/jira/browse/HDFS-15176
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Jinglun
>Assignee: Jinglun
>Priority: Minor
> Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, 
> HDFS-15176.003.patch, HDFS-15176.004.patch, HDFS-15176.005.patch
>
>
> The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension 
> to analyze the NameNode GC.  We should add a switch config to enable the 
> GcTimePercentage metric in HDFS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042474#comment-17042474
 ] 

Hudson commented on HDFS-15182:
---

SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17978 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/17978/])
HDFS-15182. TestBlockManager#testOneOfTwoRacksDecommissioned() fail in 
(ayushsaxena: rev ba9025c7cd8303dadaa792b6372a877414564cd7)
* (edit) 
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManager.java


> TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
> 
>
> Key: HDFS-15182
> URL: https://issues.apache.org/jira/browse/HDFS-15182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, 
> HDFS-15182-003.patch, HDFS-15182-004.patch
>
>
> when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it 
> will fail and throw NullPointerException.
> Since NameNode#metrics is static variable,run all uts in TestBlockManager and 
> other ut has init metrics.
> But  that it runs only testOneOfTwoRacksDecommissioned without initialing 
> metrics throws NullPointerException.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}
> And  testAllNodesHoldingReplicasDecommissioned , 
> testTwoOfThreeNodesDecommissioned , testSufficientlyReplBlocksUsesNewRack 
> also have the same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-22 Thread Yao Guangdong (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042472#comment-17042472
 ] 

Yao Guangdong commented on HDFS-15186:
--

[~ayushtkn], Yes, you are right. The major reason of the problem is from 
namenode.We also need to check the logic at HDFS-14768. But, as i said in the 
description, the case is too much, even some we don't know it. The data is very 
important in the production. We can't accept any data errors.So i think we 
could add this patch enhance the data security.And i also suggest that we need 
to increase the fault tolerance for ec algorithm.  What do you think?

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-22 Thread Ayush Saxena (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ayush Saxena updated HDFS-15182:

Fix Version/s: 3.3.0
 Hadoop Flags: Reviewed
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

> TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
> 
>
> Key: HDFS-15182
> URL: https://issues.apache.org/jira/browse/HDFS-15182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Fix For: 3.3.0
>
> Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, 
> HDFS-15182-003.patch, HDFS-15182-004.patch
>
>
> when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it 
> will fail and throw NullPointerException.
> Since NameNode#metrics is static variable,run all uts in TestBlockManager and 
> other ut has init metrics.
> But  that it runs only testOneOfTwoRacksDecommissioned without initialing 
> metrics throws NullPointerException.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}
> And  testAllNodesHoldingReplicasDecommissioned , 
> testTwoOfThreeNodesDecommissioned , testSufficientlyReplBlocksUsesNewRack 
> also have the same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042465#comment-17042465
 ] 

Ayush Saxena commented on HDFS-15182:
-

Committed to trunk.
Thanx [~leosun08] for the contribution!!!

> TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
> 
>
> Key: HDFS-15182
> URL: https://issues.apache.org/jira/browse/HDFS-15182
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Lisheng Sun
>Assignee: Lisheng Sun
>Priority: Minor
> Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, 
> HDFS-15182-003.patch, HDFS-15182-004.patch
>
>
> when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it 
> will fail and throw NullPointerException.
> Since NameNode#metrics is static variable,run all uts in TestBlockManager and 
> other ut has init metrics.
> But  that it runs only testOneOfTwoRacksDecommissioned without initialing 
> metrics throws NullPointerException.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
>   at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>   at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
>   at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>   at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>   at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
>   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
>   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
>   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
>   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
>   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
>   at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
>   at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
>   at 
> com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
>   at 
> com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
>   at 
> com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
> {code}
> And  testAllNodesHoldingReplicasDecommissioned , 
> testTwoOfThreeNodesDecommissioned , testSufficientlyReplBlocksUsesNewRack 
> also have the same problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case

2020-02-22 Thread Ayush Saxena (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042458#comment-17042458
 ] 

Ayush Saxena commented on HDFS-15186:
-

HDFS-14768 tend to make a busy live DN to be considered for live replica check. 
Seems the Decommissioning busy replica should also be included for the live 
replica check.

> Erasure Coding: Decommission may generate the parity block's content with all 
> 0 in some case
> 
>
> Key: HDFS-15186
> URL: https://issues.apache.org/jira/browse/HDFS-15186
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, erasure-coding
>Affects Versions: 3.0.3, 3.2.1, 3.1.3
>Reporter: Yao Guangdong
>Assignee: Yao Guangdong
>Priority: Critical
> Attachments: HDFS-15186.001.patch
>
>
> I can find some parity block's content with all 0 when i decommission some 
> DataNode(more than 1) from a cluster. And the probability is very big(parts 
> per thousand).This is a big problem.You can think that if we read data from 
> the zero parity block or use the zero parity block to recover a block which 
> can make us use the error data even we don't know it.
> There is some case in the below:
> B: Busy DataNode, 
> D:Decommissioning DataNode,
> Others is normal.
> 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)].
> 
> In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 
> 7, 8(D)], the DN may received reconstruct block command and the 
> liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which  in 
> the class StripedReconstructionInfo) length is 2. 
> The targets's length is 2 which mean that the DataNode need recover 2 
> internal block in current code.But from the liveIndices we only can find 1 
> missing block, so the method StripedWriter#initTargetIndices will use 0 as 
> the default recover block and don't care the indices 0 is in the sources 
> indices or not.
> When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] 
> use the ec algorithm.We can find that the indices [0] is in the both the 
> sources indices and the targets indices in this case. The returned target 
> buffer in the indices [6] is always 0 from the ec  algorithm.So I think this 
> is the ec algorithm's problem. Because it should more fault tolerance.I try 
> to fixed it .But it is too hard. Because the case is too more. The second is 
> another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to 
> recover indices [0, 6, 0]). So I changed my mind.Invoke the ec  algorithm 
> with a correct parameters. Which mean that remove the duplicate target 
> indices 0 in this case.Finally, I fixed it in this way.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org