[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042804#comment-17042804 ] Hadoop QA commented on HDFS-15041: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 53s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 21s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 9s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 1s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 1s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 7s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}108m 54s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 34s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}174m 43s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestMultipleNNPortQOP | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15041 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12994228/HDFS-15041.003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux cb59fbed169b 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ba9025c | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_242 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28827/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results |
[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042802#comment-17042802 ] Ayush Saxena commented on HDFS-15041: - Thanx [~zhuqi] for the update. v004 LGTM +1 (Pending Jenkins) > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch, HDFS-15041.004.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042798#comment-17042798 ] zhuqi commented on HDFS-15041: -- cc [~ayushtkn] Thanks for your review. I have fixed it. > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch, HDFS-15041.004.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated HDFS-15041: - Attachment: HDFS-15041.004.patch Status: Patch Available (was: In Progress) > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch, HDFS-15041.004.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated HDFS-15041: - Status: In Progress (was: Patch Available) > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch, HDFS-15041.004.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042795#comment-17042795 ] Ayush Saxena commented on HDFS-15041: - Thanx [~zhuqi] for update. {code:java} +TimeUnit.MICROSECONDS); {code} Shouldn't this be MILLISECONDS? Here change it to 4ms. {code:java} +4 {code} > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042783#comment-17042783 ] zhuqi commented on HDFS-15041: -- cc [~ayushtkn] [~weichiu] I have changed the configuration to support time units. If any other change for merging it. Thanks. > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated HDFS-15041: - Attachment: HDFS-15041.003.patch Status: Patch Available (was: In Progress) > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch, > HDFS-15041.003.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhuqi updated HDFS-15041: - Status: In Progress (was: Patch Available) > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780 ] Fei Hui edited comment on HDFS-15186 at 2/23/20 2:52 AM: - [~yaoguangdong]Thanks for reporting this. Good Catch. Sorry for late, I couldn't receive emails these days. +1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and busyindices, this cause this problem. Maybe we should fix it in namenode side. was (Author: ferhui): [~yaoguangdong]Thanks for reporting this !Good Catch! Sorry for late, I couldn't receive emails these days! +1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and busyindices, this cause this problem. Maybe we should fix it in namenode side. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042780#comment-17042780 ] Fei Hui commented on HDFS-15186: [~yaoguangdong]Thanks for reporting this !Good Catch! Sorry for late, I couldn't receive emails these days! +1 for [~ayushtkn] suggestions. I thinks indice[6] is not in liveindcies and busyindices, this cause this problem. Maybe we should fix it in namenode side. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15041) Make MAX_LOCK_HOLD_MS and full queue size configurable
[ https://issues.apache.org/jira/browse/HDFS-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042703#comment-17042703 ] Ayush Saxena commented on HDFS-15041: - Thanx [~zhuqi] for the patch. we can make the configuration support time units. and get the time using {{conf.getTimeDuration(..)}} > Make MAX_LOCK_HOLD_MS and full queue size configurable > -- > > Key: HDFS-15041 > URL: https://issues.apache.org/jira/browse/HDFS-15041 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15041.001.patch, HDFS-15041.002.patch > > > Now the MAX_LOCK_HOLD_MS and the full queue size are fixed. But different > cluster have different need for the latency and the queue health standard. > We'd better to make the two parameter configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15103) JMX endpoint and "dfsadmin" report 1 corrupt block; "fsck" reports 0
[ https://issues.apache.org/jira/browse/HDFS-15103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042702#comment-17042702 ] Ayush Saxena commented on HDFS-15103: - The corrupt replica can be from a open file, the fsck has option to include open files too. try using along with that, check fsck with the block id, the corrupt replica would be there. > JMX endpoint and "dfsadmin" report 1 corrupt block; "fsck" reports 0 > > > Key: HDFS-15103 > URL: https://issues.apache.org/jira/browse/HDFS-15103 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.1 > Environment: * CentOS 7 > * HDFS 3.2.1 > * 2x HA NNs > * 5x identical DNs >Reporter: Mark Ormesher >Priority: Major > > We're seeing a long-running discrepancy between the number of corrupted > blocks reported by the JMX endpoint and {{dfsadmin -report}} (1) and by > {{fsck /}} (0). This has persisted through rolling restarts of the NNs and > DNs, and through complete shutdowns for the HDFS cluster for unrelated > maintenance. > {panel:title=JMX endpoint snippet} > {code} > (...) > "CorruptBlocks" : 1, > "ScheduledReplicationBlocks" : 0, > "PendingDeletionBlocks" : 0, > "LowRedundancyReplicatedBlocks" : 0, > "CorruptReplicatedBlocks" : 1, > "MissingReplicatedBlocks" : 0, > "MissingReplicationOneBlocks" : 0, > (...) > {code} > {panel} > {panel:title=dfsadmin -report} > {code} > $ ./hdfs dfsadmin -report | grep -i corrupt > Blocks with corrupt replicas: 1 > Block groups with corrupt internal blocks: 0 > {code} > {panel} > {panel:title=fsck /} > {code} > $ ./hdfs fsck / -files -blocks | grep -i corrupt > Corrupt blocks:0 > Corrupt block groups: 0 > {code} > {panel} > I've read through the related tickets below, all of which suggest this issue > was resolved in 2.7.8, but we're seeing it in 3.2.1. > https://issues.apache.org/jira/browse/HDFS-8533 > https://issues.apache.org/jira/browse/HDFS-10213 > https://issues.apache.org/jira/browse/HDFS-13999 > How can we work out whether we really do have a corrupt block, and if we do > how can we work out which block it is if {{fsck}} thinks everything is fine? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15114) JournalNodes' committed-txid file includes aborted transaction, breaks NameNode startup
[ https://issues.apache.org/jira/browse/HDFS-15114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042696#comment-17042696 ] Ayush Saxena commented on HDFS-15114: - Thanx [~Steven Rand] for the report. Are you able to reproduce the issue? Will it be possible to extend a UT for it? > JournalNodes' committed-txid file includes aborted transaction, breaks > NameNode startup > --- > > Key: HDFS-15114 > URL: https://issues.apache.org/jira/browse/HDFS-15114 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node, namenode >Affects Versions: 3.2.1 >Reporter: Steven Rand >Priority: Major > > A couple of days ago, our active NameNode in an HA setup aborted a > {{QuorumOutputStream}} starting at tx 3389424 because tx 3389425 failed to be > written. This was likely related to a rolling restart of the three > JournalNodes that was happening at this time. The NameNode logged: > {code:java} > 2020-01-11 02:00:50,229 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from > 10.6.1.181 > 2020-01-11 02:00:50,229 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs > 2020-01-11 02:00:50,229 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 3389424, > 3389424 > 2020-01-11 02:00:50,229 INFO > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 > Total time for transactions(ms): 1 Number of transactions batched in Syncs: > 0 Number of syncs: 1 SyncTimes(ms): 1 7 > 2020-01-11 02:00:50,245 WARN > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal > 10.6.2.187:8485 failed to write txns 3389425-3389425. Will try to write to > this JN again after the next log roll. > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): > Can't write, no segment open ; journal id: > at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:545) > ... rest of stacktrace ... > // the same warning for the second JournalNode > // the same warning for the third JournalNode > 2020-01-11 02:00:50,246 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for > required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, > 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream starting at txid > 3389424)) > org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many > exceptions to achieve quorum size 2/3. 3 exceptions thrown: > // the same "Can't write, no segment open ; journal id: " error > for all 3 JournalNodes > 2020-01-11 02:00:50,246 WARN > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting > QuorumOutputStream starting at txid 3389424 > 2020-01-11 02:00:50,255 INFO org.apache.hadoop.util.ExitUtil: Exiting with > status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM > to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], > stream=QuorumOutputStream starting at txid 3389424)) > {code} > Even though the stream was aborted, the {{committed-txid}} file on each of > the three JournalNodes was updated to be {{3389424}}. > This caused both NameNodes to fail to start with this error: > > {code:java} > 2020-01-11 02:54:35,483 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services > required for active state > 2020-01-11 02:54:35,491 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting > recovery process for unclosed journal segments... > 2020-01-11 02:54:35,537 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully > started new epoch 80 > 2020-01-11 02:54:35,537 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning > recovery of unclosed segment starting at txid 3389422 > 2020-01-11 02:54:35,574 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare > phase complete. Responses: > 10.6.1.4:8485: segmentState { startTxId: 3389422 endTxId: 3389423 > isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424 > 10.6.2.187:8485: segmentState { startTxId: 3389422 endTxId: 3389423 > isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424 > 2020-01-11 02:54:35,575 INFO > org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest > log: 10.6.1.4:8485=segmentState { > startTxId: 3389422 > endTxId: 3389423 > isInProgress: false > } > lastWriterEpoch: 57 > lastCommittedTxId: 3389424 > 2020-01-11 02:54:35,575 ERROR > org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: > recoverUnfinalizedSegments failed for required journal > (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, > 10.6.2.187:8485], stream=null)) >
[jira] [Commented] (HDFS-15124) Crashing bugs in NameNode when using a valid configuration for `dfs.namenode.audit.loggers`
[ https://issues.apache.org/jira/browse/HDFS-15124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042695#comment-17042695 ] Ayush Saxena commented on HDFS-15124: - Thanx Everyone for the work here. [~ctest.team] you can check {{TestFsNamesystem.java}} there is a {{testInitAuditLoggers()}} which already tests most of the things. You can add your case also in the same test. You can add at the end in same the same test and do something like this to repro your issue : {code:java} // Test Configuring TopAuditLogger. conf.set(DFSConfigKeys.DFS_NAMENODE_AUDIT_LOGGERS_KEY, "org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger"); fsn = new FSNamesystem(conf, fsImage); auditLoggers = fsn.getAuditLoggers(); assertEquals(1, auditLoggers.size()); assertThat(auditLoggers.get(0), instanceOf(TopAuditLogger.class)); {code} Let me know, if you intend to write test for something else and facing trouble, will try to help write one. :) > Crashing bugs in NameNode when using a valid configuration for > `dfs.namenode.audit.loggers` > --- > > Key: HDFS-15124 > URL: https://issues.apache.org/jira/browse/HDFS-15124 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Ctest >Assignee: Ctest >Priority: Critical > Attachments: HDFS-15124.000.patch, HDFS-15124.001.patch, > HDFS-15124.002.patch, HDFS-15124.003.patch, HDFS-15124.004.patch, > HDFS-15124.005.patch > > > I am using Hadoop-2.10.0. > The configuration parameter `dfs.namenode.audit.loggers` allows `default` > (which is the default value) and > `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`. > When I use `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`, > namenode will not be started successfully because of an > `InstantiationException` thrown from > `org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers`. > The root cause is that while initializing namenode, `initAuditLoggers` will > be called and it will try to call the default constructor of > `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger` which doesn't > have a default constructor. Thus the `InstantiationException` exception is > thrown. > > *Symptom* > *$ ./start-dfs.sh* > {code:java} > 2019-12-18 14:05:20,670 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem > initialization failed.java.lang.RuntimeException: > java.lang.InstantiationException: > org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1024) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:858) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:677) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:674) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:736) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:961) > at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:940) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1714) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1782) > Caused by: java.lang.InstantiationException: > org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger > at java.lang.Class.newInstance(Class.java:427) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initAuditLoggers(FSNamesystem.java:1017)... > 8 more > Caused by: java.lang.NoSuchMethodException: > org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger.() > at java.lang.Class.getConstructor0(Class.java:3082) > at java.lang.Class.newInstance(Class.java:412) > ... 9 more{code} > > > *Detailed Root Cause* > There is no default constructor in > `org.apache.hadoop.hdfs.server.namenode.top.TopAuditLogger`: > {code:java} > /** > * An {@link AuditLogger} that sends logged data directly to the metrics > * systems. It is used when the top service is used directly by the name node > */ > @InterfaceAudience.Private > public class TopAuditLogger implements AuditLogger { > public static finalLogger LOG = > LoggerFactory.getLogger(TopAuditLogger.class); > private final TopMetrics topMetrics; > public TopAuditLogger(TopMetrics topMetrics) { > Preconditions.checkNotNull(topMetrics, "Cannot init with a null " + > "TopMetrics"); > this.topMetrics = topMetrics; > } > @Override > public void initialize(Configuration conf) { > } > {code} > As long as the configuration parameter `dfs.namenode.audit.loggers` is set to >
[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.
[ https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042631#comment-17042631 ] Ayush Saxena commented on HDFS-15176: - Thanx [~LiJinglun] for the patch. nit : In the doc, can you surround the configuration names with {{`name`}} to highlight them as done even for {{rpc context}} metrics. Like this : {code:java} | `GcTimePercentage` | The percentage (0..100) of time that the JVM spent in GC pauses within the observation window if `dfs.namenode.gc.time.monitor.enable` is set to true. Use `dfs.namenode.gc.time.monitor.sleep.interval.ms` to specify the sleep interval in msec. Use `dfs.namenode.gc.time.monitor.observation.window.ms` to specify the observation window in msec. | {code} Apart LGTM +1 > Enable GcTimePercentage Metric in NameNode's JvmMetrics. > > > Key: HDFS-15176 > URL: https://issues.apache.org/jira/browse/HDFS-15176 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Minor > Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, > HDFS-15176.003.patch, HDFS-15176.004.patch, HDFS-15176.005.patch > > > The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension > to analyze the NameNode GC. We should add a switch config to enable the > GcTimePercentage metric in HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042474#comment-17042474 ] Hudson commented on HDFS-15182: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17978 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17978/]) HDFS-15182. TestBlockManager#testOneOfTwoRacksDecommissioned() fail in (ayushsaxena: rev ba9025c7cd8303dadaa792b6372a877414564cd7) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/server/blockmanagement/TestBlockManager.java > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Fix For: 3.3.0 > > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, > HDFS-15182-003.patch, HDFS-15182-004.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} > And testAllNodesHoldingReplicasDecommissioned , > testTwoOfThreeNodesDecommissioned , testSufficientlyReplBlocksUsesNewRack > also have the same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042472#comment-17042472 ] Yao Guangdong commented on HDFS-15186: -- [~ayushtkn], Yes, you are right. The major reason of the problem is from namenode.We also need to check the logic at HDFS-14768. But, as i said in the description, the case is too much, even some we don't know it. The data is very important in the production. We can't accept any data errors.So i think we could add this patch enhance the data security.And i also suggest that we need to increase the fault tolerance for ec algorithm. What do you think? > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15182: Fix Version/s: 3.3.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Fix For: 3.3.0 > > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, > HDFS-15182-003.patch, HDFS-15182-004.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} > And testAllNodesHoldingReplicasDecommissioned , > testTwoOfThreeNodesDecommissioned , testSufficientlyReplBlocksUsesNewRack > also have the same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042465#comment-17042465 ] Ayush Saxena commented on HDFS-15182: - Committed to trunk. Thanx [~leosun08] for the contribution!!! > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, > HDFS-15182-003.patch, HDFS-15182-004.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} > And testAllNodesHoldingReplicasDecommissioned , > testTwoOfThreeNodesDecommissioned , testSufficientlyReplBlocksUsesNewRack > also have the same problem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17042458#comment-17042458 ] Ayush Saxena commented on HDFS-15186: - HDFS-14768 tend to make a busy live DN to be considered for live replica check. Seems the Decommissioning busy replica should also be included for the live replica check. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org