[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041633#comment-17041633 ] Ayush Saxena commented on HDFS-15182: - Jenkins seems having complains. Have triggered again. > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, > HDFS-15182-003.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15174) Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations
[ https://issues.apache.org/jira/browse/HDFS-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041619#comment-17041619 ] Lisheng Sun commented on HDFS-15174: XiaoMi HDFS uses the replica size in memory by replicaInfo.getBytesOnDisk(). so this reduce io operations and disk pressure. > Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations > - > > Key: HDFS-15174 > URL: https://issues.apache.org/jira/browse/HDFS-15174 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15174-001.patch > > > Calculating the size of each block and the size of the meta file requires io > operation In ReplicaCachingGetSpaceUsed#refresh(). Pressure on disk > performance when there are many block. HDFS-14313 is intended to reduce io > operation. So get block size by ReplicaInfo and meta size by > DataChecksum#getChecksumSize(). > {code:java} > @Override > protected void refresh() { > if (CollectionUtils.isNotEmpty(replicaInfos)) { > for (ReplicaInfo replicaInfo : replicaInfos) { > if (Objects.equals(replicaInfo.getVolume().getStorageID(), > volume.getStorageID())) { > dfsUsed += replicaInfo.getBlockDataLength(); > dfsUsed += replicaInfo.getMetadataLength(); > count++; > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15174) Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations
[ https://issues.apache.org/jira/browse/HDFS-15174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15174: --- Attachment: HDFS-15174-001.patch Status: Patch Available (was: Open) > Optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations > - > > Key: HDFS-15174 > URL: https://issues.apache.org/jira/browse/HDFS-15174 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15174-001.patch > > > Calculating the size of each block and the size of the meta file requires io > operation In ReplicaCachingGetSpaceUsed#refresh(). Pressure on disk > performance when there are many block. HDFS-14313 is intended to reduce io > operation. So get block size by ReplicaInfo and meta size by > DataChecksum#getChecksumSize(). > {code:java} > @Override > protected void refresh() { > if (CollectionUtils.isNotEmpty(replicaInfos)) { > for (ReplicaInfo replicaInfo : replicaInfos) { > if (Objects.equals(replicaInfo.getVolume().getStorageID(), > volume.getStorageID())) { > dfsUsed += replicaInfo.getBlockDataLength(); > dfsUsed += replicaInfo.getMetadataLength(); > count++; > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()
[ https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583 ] Lisheng Sun edited comment on HDFS-15039 at 2/21/20 6:53 AM: - Such the problem that excessive io operations does exist LGTM. HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). was (Author: leosun08): Such the problem that excessive io operations does exist But there are also io operations in LocalReplica#getMetaFile of this patch. HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). > Cache meta file length of FinalizedReplica to reduce call File.length() > --- > > Key: HDFS-15039 > URL: https://issues.apache.org/jira/browse/HDFS-15039 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15039.patch > > > When use ReplicaCachingGetSpaceUsed to get the volume space used. It will > call File.length() for every meta file of replica. That add more disk IO, we > found the slow log as below. For finalized replica, the size of meta file is > not changed, i think we can cache the value. > {code:java} > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas > size: 1166 dfsUsed: 72227113183 on volume: > DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()
[ https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583 ] Lisheng Sun edited comment on HDFS-15039 at 2/21/20 6:49 AM: - Such the problem that excessive io operations does exist But there are also io operations in LocalReplica#getMetaFile of this patch. HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). was (Author: leosun08): LGTM. Such the problem that excessive io operations does exist HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). > Cache meta file length of FinalizedReplica to reduce call File.length() > --- > > Key: HDFS-15039 > URL: https://issues.apache.org/jira/browse/HDFS-15039 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15039.patch > > > When use ReplicaCachingGetSpaceUsed to get the volume space used. It will > call File.length() for every meta file of replica. That add more disk IO, we > found the slow log as below. For finalized replica, the size of meta file is > not changed, i think we can cache the value. > {code:java} > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas > size: 1166 dfsUsed: 72227113183 on volume: > DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041603#comment-17041603 ] Hadoop QA commented on HDFS-15187: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 7s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 21m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 3s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 46s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 24s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 17m 37s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 41s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 58s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 1s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 52s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}106m 38s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 32s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}175m 17s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestDeadDatanode | | | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15187 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12994046/HDFS-15187-02.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux b2d096ad0770 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7f35676 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28819/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28819/testReport/ | | Max. process+thread count | 3053 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output |
[jira] [Comment Edited] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()
[ https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583 ] Lisheng Sun edited comment on HDFS-15039 at 2/21/20 6:29 AM: - LGTM. Such the problem that excessive io operations does exist HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). was (Author: leosun08): LGTM. HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). > Cache meta file length of FinalizedReplica to reduce call File.length() > --- > > Key: HDFS-15039 > URL: https://issues.apache.org/jira/browse/HDFS-15039 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15039.patch > > > When use ReplicaCachingGetSpaceUsed to get the volume space used. It will > call File.length() for every meta file of replica. That add more disk IO, we > found the slow log as below. For finalized replica, the size of meta file is > not changed, i think we can cache the value. > {code:java} > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas > size: 1166 dfsUsed: 72227113183 on volume: > DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()
[ https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041583#comment-17041583 ] Lisheng Sun commented on HDFS-15039: LGTM. HDFS-15174 optimize ReplicaCachingGetSpaceUsed by reducing unnecessary io operations. e.g. Get block size by ReplicaInfo and meta size by DataChecksum#getChecksumSize(). > Cache meta file length of FinalizedReplica to reduce call File.length() > --- > > Key: HDFS-15039 > URL: https://issues.apache.org/jira/browse/HDFS-15039 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15039.patch > > > When use ReplicaCachingGetSpaceUsed to get the volume space used. It will > call File.length() for every meta file of replica. That add more disk IO, we > found the slow log as below. For finalized replica, the size of meta file is > not changed, i think we can cache the value. > {code:java} > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas > size: 1166 dfsUsed: 72227113183 on volume: > DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.
[ https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041582#comment-17041582 ] Hadoop QA commented on HDFS-15176: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 34s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 1m 8s{color} | {color:blue} Maven dependency ordering for branch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 17m 32s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 39s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 19m 12s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 54s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:blue}0{color} | {color:blue} mvndep {color} | {color:blue} 0m 24s{color} | {color:blue} Maven dependency ordering for patch {color} | | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 46s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 14m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 14m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 2m 36s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 2m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} xml {color} | {color:green} 0m 2s{color} | {color:green} The patch has no ill-formed XML file. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 12s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 5m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 55s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 8m 55s{color} | {color:green} hadoop-common in the patch passed. {color} | | {color:red}-1{color} | {color:red} unit {color} | {color:red} 95m 29s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 58s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}211m 3s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15176 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12994059/HDFS-15176.005.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle xml | | uname | Linux 0c7d3cd735b8 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 181e6d0 | | maven | version:
[jira] [Commented] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.
[ https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041574#comment-17041574 ] zhuqi commented on HDFS-15171: -- Hi [~weichiu] There are no cache file if the datanode shutdow ungracefully , change the dfs.datanode.cached-dfsused.check.interval.ms will not help my case. The HDFS-14313 should can reduce the refresh time, i will try it. Thanks. > Add a thread to call saveDfsUsed periodically, to prevent datanode too long > restart time. > --- > > Key: HDFS-15171 > URL: https://issues.apache.org/jira/browse/HDFS-15171 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > There are 30 storage dirs per datanode in our production cluster , it will > take too many time to restart, because sometimes the datanode didn't shutdown > gracefully. Now only the datanode graceful shut down hook and the > blockpoolslice shutdown will cause the saveDfsUsed function, that cause the > restart of datanode can't reuse the dfsuse cache sometimes. I think if we can > add a thread to periodically call the saveDfsUsed function. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.
[ https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041574#comment-17041574 ] zhuqi edited comment on HDFS-15171 at 2/21/20 6:08 AM: --- Hi [~weichiu] There are no cache file if the datanode shutdown ungracefully , change the dfs.datanode.cached-dfsused.check.interval.ms will not help my case. The HDFS-14313 should can reduce the refresh time, i will try it. Thanks. was (Author: zhuqi): Hi [~weichiu] There are no cache file if the datanode shutdow ungracefully , change the dfs.datanode.cached-dfsused.check.interval.ms will not help my case. The HDFS-14313 should can reduce the refresh time, i will try it. Thanks. > Add a thread to call saveDfsUsed periodically, to prevent datanode too long > restart time. > --- > > Key: HDFS-15171 > URL: https://issues.apache.org/jira/browse/HDFS-15171 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > There are 30 storage dirs per datanode in our production cluster , it will > take too many time to restart, because sometimes the datanode didn't shutdown > gracefully. Now only the datanode graceful shut down hook and the > blockpoolslice shutdown will cause the saveDfsUsed function, that cause the > restart of datanode can't reuse the dfsuse cache sometimes. I think if we can > add a thread to periodically call the saveDfsUsed function. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.
[ https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041564#comment-17041564 ] zhuqi commented on HDFS-15171: -- Hi [~sodonnell] Thanks for your patient reply. First, the every 10 minutes thread in CachingGetSpaceUsed, now with a random jitter time to random the refresh operation, and if we can persist the value to the cache file when the value refresh, this is the most real time cache. Second, when the value refresh, we can compare it with last one, if they are same, we can jump the persist operation to reduce the disk operation. In order to reduce the disk operation, we can add a fixed time interval which can be configurated, when the real time fresh time exceed the fixed time interval , then to persist the value to disk. Then we can remove the shutdown hook persist operation and don't need to caculate what dfs.datanode.cached-dfsused.check.interval.ms is suitable anymore. And also can reslove my problem, which caused by the datanode shutdown ungracefully. What do you think about my advice? > Add a thread to call saveDfsUsed periodically, to prevent datanode too long > restart time. > --- > > Key: HDFS-15171 > URL: https://issues.apache.org/jira/browse/HDFS-15171 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > There are 30 storage dirs per datanode in our production cluster , it will > take too many time to restart, because sometimes the datanode didn't shutdown > gracefully. Now only the datanode graceful shut down hook and the > blockpoolslice shutdown will cause the saveDfsUsed function, that cause the > restart of datanode can't reuse the dfsuse cache sometimes. I think if we can > add a thread to periodically call the saveDfsUsed function. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041561#comment-17041561 ] Vinayakumar B commented on HDFS-15187: -- Thanks for the nice catch [~ayushtkn]. Changes looks fine to me. +1, Pending typo change. Please confirm about test failures as well. > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15052) WebHDFS getTrashRoot leads to OOM due to FileSystem object creation
[ https://issues.apache.org/jira/browse/HDFS-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Masatake Iwasaki updated HDFS-15052: Fix Version/s: 2.10.1 3.2.2 3.1.4 3.3.0 Hadoop Flags: Reviewed Resolution: Fixed Status: Resolved (was: Patch Available) Committed to trunk, branch-3.2, branch-3.1 and branch-2.10. Thanks, [~weichiu], [~sodonnell] and [~kihwal]. > WebHDFS getTrashRoot leads to OOM due to FileSystem object creation > --- > > Key: HDFS-15052 > URL: https://issues.apache.org/jira/browse/HDFS-15052 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Wei-Chiu Chuang >Assignee: Masatake Iwasaki >Priority: Major > Fix For: 3.3.0, 3.1.4, 3.2.2, 2.10.1 > > > Quoting [~daryn] in HDFS-10756 : > {quote}Surprised nobody has discovered this will lead to an inevitable OOM in > the NN. The NN should not be creating filesystems to itself, and must never > create filesystems in a remote user's context or the cache will explode. > {quote} > I guess the problem lies in side NamenodeWebHdfsMethods#getTrashRoot > {code:java} > private static String getTrashRoot(String fullPath, > Configuration conf) throws IOException { > FileSystem fs = FileSystem.get(conf != null ? conf : new > Configuration()); > return fs.getTrashRoot( > new org.apache.hadoop.fs.Path(fullPath)).toUri().getPath(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041533#comment-17041533 ] Íñigo Goiri commented on HDFS-15187: Small typo: isPreviousMessgageProcessed > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15177) Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too much time.
[ https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041510#comment-17041510 ] zhuqi commented on HDFS-15177: -- cc [~sodonnell] Thanks your patient reply. I will change to fair. > Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too > much time. > -- > > Key: HDFS-15177 > URL: https://issues.apache.org/jira/browse/HDFS-15177 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: image-2020-02-18-22-39-00-642.png, > image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, > image-2020-02-18-22-55-38-661.png > > > In our cluster, the datanode receive the delete command with too many blocks > deletion when we have many blockpools sharing the same datanode and the > datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too > much time. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041506#comment-17041506 ] Hadoop QA commented on HDFS-15182: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 44s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 28m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 30s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 17s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 55s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 23m 31s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 4m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 32s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 38s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 6s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 20m 13s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:red}-1{color} | {color:red} findbugs {color} | {color:red} 1m 42s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 59s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 17m 0s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 43s{color} | {color:red} The patch generated 22 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}109m 47s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDFSOutputStream | | | hadoop.hdfs.TestDatanodeReport | | | hadoop.hdfs.TestFileStatusWithDefaultECPolicy | | | hadoop.hdfs.TestAppendSnapshotTruncate | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15182 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12994055/HDFS-15182-003.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux dc9b81484b1c 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 181e6d0 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_242 | | findbugs | v3.1.0-RC1 | | findbugs | https://builds.apache.org/job/PreCommit-HDFS-Build/28817/artifact/out/patch-findbugs-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28817/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28817/testReport/ | | asflicense |
[jira] [Commented] (HDFS-15052) WebHDFS getTrashRoot leads to OOM due to FileSystem object creation
[ https://issues.apache.org/jira/browse/HDFS-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041489#comment-17041489 ] Hudson commented on HDFS-15052: --- SUCCESS: Integrated in Jenkins build Hadoop-trunk-Commit #17971 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/17971/]) HDFS-15052. WebHDFS getTrashRoot leads to OOM due to FileSystem object (github: rev 2338d25dc7150d75fbda84cc95422380b564) * (edit) hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/web/TestWebHDFS.java * (edit) hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java * (edit) hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSUtilClient.java * (edit) hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/web/resources/NamenodeWebHdfsMethods.java > WebHDFS getTrashRoot leads to OOM due to FileSystem object creation > --- > > Key: HDFS-15052 > URL: https://issues.apache.org/jira/browse/HDFS-15052 > Project: Hadoop HDFS > Issue Type: Bug > Components: webhdfs >Affects Versions: 2.9.0, 3.0.0-alpha2 >Reporter: Wei-Chiu Chuang >Assignee: Masatake Iwasaki >Priority: Major > > Quoting [~daryn] in HDFS-10756 : > {quote}Surprised nobody has discovered this will lead to an inevitable OOM in > the NN. The NN should not be creating filesystems to itself, and must never > create filesystems in a remote user's context or the cache will explode. > {quote} > I guess the problem lies in side NamenodeWebHdfsMethods#getTrashRoot > {code:java} > private static String getTrashRoot(String fullPath, > Configuration conf) throws IOException { > FileSystem fs = FileSystem.get(conf != null ? conf : new > Configuration()); > return fs.getTrashRoot( > new org.apache.hadoop.fs.Path(fullPath)).toUri().getPath(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15185) StartupProgress reports edits segments until the entire startup completes
[ https://issues.apache.org/jira/browse/HDFS-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041488#comment-17041488 ] Hadoop QA commented on HDFS-15185: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 21s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 10s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 0s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 9s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 56s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 0s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 19s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red} 98m 25s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 55s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}163m 45s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.web.TestWebHDFS | | | hadoop.hdfs.TestSafeModeWithStripedFile | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15185 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12993964/HDFS-15185.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 45add0e916c7 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 181e6d0 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_242 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28816/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28816/testReport/ | | Max. process+thread count | 3185 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28816/console | | Powered by | Apache Yetus 0.8.0
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041486#comment-17041486 ] Hadoop QA commented on HDFS-15187: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 40s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 4s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 15s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 54s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 8s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 44s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 3m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 47s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}147m 7s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 42s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}220m 0s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.TestDecommissioningStatus | | | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithDFSStriped | | | hadoop.hdfs.server.namenode.TestNamenodeCapacityReport | | | hadoop.hdfs.server.namenode.TestCheckPointForSecurityTokens | | | hadoop.hdfs.server.namenode.TestProtectedDirectories | | | hadoop.hdfs.server.namenode.TestFSNamesystemMBean | | | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.server.namenode.TestStartup | | | hadoop.hdfs.server.namenode.TestFSImageWithSnapshot | | | hadoop.hdfs.server.namenode.TestFileContextAcl | | | hadoop.hdfs.server.namenode.TestFSImageWithAcl | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15187 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12994046/HDFS-15187-02.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 973f6db62f7d 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 181e6d0 | | maven | version:
[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.
[ https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041475#comment-17041475 ] Jinglun commented on HDFS-15176: Agree with [~weichiu], enable GcTimeMonitor by default in patch v05. > Enable GcTimePercentage Metric in NameNode's JvmMetrics. > > > Key: HDFS-15176 > URL: https://issues.apache.org/jira/browse/HDFS-15176 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Minor > Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, > HDFS-15176.003.patch, HDFS-15176.004.patch, HDFS-15176.005.patch > > > The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension > to analyze the NameNode GC. We should add a switch config to enable the > GcTimePercentage metric in HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.
[ https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinglun updated HDFS-15176: --- Attachment: HDFS-15176.005.patch > Enable GcTimePercentage Metric in NameNode's JvmMetrics. > > > Key: HDFS-15176 > URL: https://issues.apache.org/jira/browse/HDFS-15176 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Minor > Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, > HDFS-15176.003.patch, HDFS-15176.004.patch, HDFS-15176.005.patch > > > The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension > to analyze the NameNode GC. We should add a switch config to enable the > GcTimePercentage metric in HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15172) Remove unnecessary deadNodeDetectInterval in DeadNodeDetector#checkDeadNodes()
[ https://issues.apache.org/jira/browse/HDFS-15172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041436#comment-17041436 ] Lisheng Sun commented on HDFS-15172: hi [~elgoiri] should we commit it to trunk? HDFS-15149 depends on it. Thank you. > Remove unnecessary deadNodeDetectInterval in > DeadNodeDetector#checkDeadNodes() > --- > > Key: HDFS-15172 > URL: https://issues.apache.org/jira/browse/HDFS-15172 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Major > Attachments: HDFS-15172-001.patch, HDFS-15172-002.patch > > > Every call to checkDeadNodes() will change the state to IDLE forcing the > DeadNodeDetector to sleep for IDLE_SLEEP_MS. So we don't need > deadNodeDetectInterval between every checkDeadNodes(). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041432#comment-17041432 ] Lisheng Sun commented on HDFS-15182: that do the init in @Before will affect testAsyncIBR UT, so i remove it out. > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, > HDFS-15182-003.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15182) TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk
[ https://issues.apache.org/jira/browse/HDFS-15182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-15182: --- Attachment: HDFS-15182-003.patch > TestBlockManager#testOneOfTwoRacksDecommissioned() fail in trunk > > > Key: HDFS-15182 > URL: https://issues.apache.org/jira/browse/HDFS-15182 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Lisheng Sun >Assignee: Lisheng Sun >Priority: Minor > Attachments: HDFS-15182-001.patch, HDFS-15182-002.patch, > HDFS-15182-003.patch > > > when run only a UT of TestBlockManager#testOneOfTwoRacksDecommissioned(), it > will fail and throw NullPointerException. > Since NameNode#metrics is static variable,run all uts in TestBlockManager and > other ut has init metrics. > But that it runs only testOneOfTwoRacksDecommissioned without initialing > metrics throws NullPointerException. > {code:java} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.addBlock(BlockManager.java:4088) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.fulfillPipeline(TestBlockManager.java:518) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:388) > at > org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:353) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) > at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) > at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) > at org.junit.runners.ParentRunner.run(ParentRunner.java:363) > at org.junit.runner.JUnitCore.run(JUnitCore.java:137) > at > com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68) > at > com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47) > at > com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242) > at > com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15039) Cache meta file length of FinalizedReplica to reduce call File.length()
[ https://issues.apache.org/jira/browse/HDFS-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041383#comment-17041383 ] Wei-Chiu Chuang commented on HDFS-15039: Looks good to me. [~leosun08] fyi > Cache meta file length of FinalizedReplica to reduce call File.length() > --- > > Key: HDFS-15039 > URL: https://issues.apache.org/jira/browse/HDFS-15039 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Yang Yun >Assignee: Yang Yun >Priority: Minor > Attachments: HDFS-15039.patch > > > When use ReplicaCachingGetSpaceUsed to get the volume space used. It will > call File.length() for every meta file of replica. That add more disk IO, we > found the slow log as below. For finalized replica, the size of meta file is > not changed, i think we can cache the value. > {code:java} > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > Refresh dfs used, bpid: BP-898717543-10.75.1.240-1519386995727 replicas > size: 1166 dfsUsed: 72227113183 on volume: > DS-3add8d62-d69a-4f5a-a29f-b7bbb400af2e duration: 17206ms{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15171) Add a thread to call saveDfsUsed periodically, to prevent datanode too long restart time.
[ https://issues.apache.org/jira/browse/HDFS-15171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041367#comment-17041367 ] Wei-Chiu Chuang commented on HDFS-15171: The first thing coming to my mind is increasing dfs.datanode.cached-dfsused.check.interval.ms like Stephen said. Otherwise, HDFS-14313 may be useful too. > Add a thread to call saveDfsUsed periodically, to prevent datanode too long > restart time. > --- > > Key: HDFS-15171 > URL: https://issues.apache.org/jira/browse/HDFS-15171 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > > There are 30 storage dirs per datanode in our production cluster , it will > take too many time to restart, because sometimes the datanode didn't shutdown > gracefully. Now only the datanode graceful shut down hook and the > blockpoolslice shutdown will cause the saveDfsUsed function, that cause the > restart of datanode can't reuse the dfsuse cache sometimes. I think if we can > add a thread to periodically call the saveDfsUsed function. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15176) Enable GcTimePercentage Metric in NameNode's JvmMetrics.
[ https://issues.apache.org/jira/browse/HDFS-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041364#comment-17041364 ] Wei-Chiu Chuang commented on HDFS-15176: Thanks i didn't even realize there's GcTimeMonitor in our code. Shouldn't we enable it by default? it doesn't look like a heavy weight operation... > Enable GcTimePercentage Metric in NameNode's JvmMetrics. > > > Key: HDFS-15176 > URL: https://issues.apache.org/jira/browse/HDFS-15176 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Minor > Attachments: HDFS-15176.001.patch, HDFS-15176.002.patch, > HDFS-15176.003.patch, HDFS-15176.004.patch > > > The GcTimePercentage(computed by GcTimeMonitor) could be used as a dimension > to analyze the NameNode GC. We should add a switch config to enable the > GcTimePercentage metric in HDFS. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15187: Attachment: HDFS-15187-02.patch > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch, HDFS-15187-02.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15187: Attachment: HDFS-15187-02.patch > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15187: Attachment: (was: HDFS-15187-02.patch) > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041319#comment-17041319 ] Wei-Chiu Chuang commented on HDFS-15186: [~ferhui] would you be interested in reviewing this patch? You've fixed a number of critical EC decomm bugs in the past. Thanks. > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041306#comment-17041306 ] Wei-Chiu Chuang commented on HDFS-15186: [~avijayan] could you take a look at the patch? Or [~ayushtkn] please > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15185) StartupProgress reports edits segments until the entire startup completes
[ https://issues.apache.org/jira/browse/HDFS-15185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041290#comment-17041290 ] Chen Liang commented on HDFS-15185: --- I have tested this fix on a real cluster, the patch did get rid of the excessive ByteString displays. +1 with the Jenkins warnings addressed. > StartupProgress reports edits segments until the entire startup completes > - > > Key: HDFS-15185 > URL: https://issues.apache.org/jira/browse/HDFS-15185 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.10.0 >Reporter: Konstantin Shvachko >Assignee: Konstantin Shvachko >Priority: Major > Attachments: HDFS-15185.001.patch > > > Startup Progress page keeps reporting edits segments after the {{LOAD_EDITS}} > stage is complete. New steps are added to StartupProgress while journal > tailing until all startup phases are completed. This adds a lot of edits > steps, since {{SAFEMODE}} phase can take a long time on a large cluster. > With fast tailing the segments are small, but the number of them is large - > 160K. This makes the page load forever. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041247#comment-17041247 ] Hadoop QA commented on HDFS-15187: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 42s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 57s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 4s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 50s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 39s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 1s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 57s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 40s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 4s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 58s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 53s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 38s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}114m 3s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 31s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}178m 52s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestReconstructStripedFile | | | hadoop.hdfs.TestRollingUpgrade | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15187 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12994013/HDFS-15187-01.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 2866567ae8e6 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 181e6d0 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28814/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28814/testReport/ | | Max. process+thread count | 2995 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://builds.apache.org/job/PreCommit-HDFS-Build/28814/console | | Powered by | Apache Yetus 0.8.0
[jira] [Commented] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041212#comment-17041212 ] Íñigo Goiri commented on HDFS-15187: Thanks [~ayushtkn] for the patch. * Can we add a javadoc to processAndHandleReportedBlock explaining what true and false mean? * I would like a more descriptive name for "response." * The code where we do the "continue" is a little hard to follow, what about: {code} private void processQueuedMessages(Iterable rbis) throws IOException { boolean response = true; for (ReportedBlockInfo rbi : rbis) { LOG.debug("Processing previouly queued message {}", rbi); if (rbi.getReportedState() == null) { // This is a DELETE_BLOCK request DatanodeStorageInfo storageInfo = rbi.getStorageInfo(); removeStoredBlock(getStoredBlock(rbi.getBlock()), storageInfo.getDatanodeDescriptor()); } else if (!response) { // if the previous IBR processing was skipped, skip processing all // further IBR's so as to ensure same sequence of processing. queueReportedBlock(rbi.getStorageInfo(), rbi.getBlock(), rbi.getReportedState(), QUEUE_REASON_FUTURE_GENSTAMP); } else { response = processAndHandleReportedBlock( rbi.getStorageInfo(), rbi.getBlock(), rbi.getReportedState(), null); } } } {code} > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15177) Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too much time.
[ https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17041204#comment-17041204 ] Stephen O'Donnell commented on HDFS-15177: -- Initially the lock used the default setting, but HDFS-12137 change it to fair (hardcoded). In HDFS-15150 we have made it configurable. > Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too > much time. > -- > > Key: HDFS-15177 > URL: https://issues.apache.org/jira/browse/HDFS-15177 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: image-2020-02-18-22-39-00-642.png, > image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, > image-2020-02-18-22-55-38-661.png > > > In our cluster, the datanode receive the delete command with too many blocks > deletion when we have many blockpools sharing the same datanode and the > datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too > much time. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15187: Status: Patch Available (was: Open) > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
[ https://issues.apache.org/jira/browse/HDFS-15187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15187: Attachment: HDFS-15187-01.patch > CORRUPT replica mismatch between namenodes after failover > - > > Key: HDFS-15187 > URL: https://issues.apache.org/jira/browse/HDFS-15187 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ayush Saxena >Assignee: Ayush Saxena >Priority: Critical > Attachments: HDFS-15187-01.patch > > > The corrupt replica identified by Active Namenode, isn't identified by the > Other Namenode, when it is failovered to Active, in case the replica is being > marked corrupt due to updatePipeline. > Scenario to repro : > 1. Create a file, while writing turn one datanode down, to trigger update > pipeline. > 2. Write some more data. > 3. Close the file. > 4. Turn on the shutdown datanode. > 5. The replica in the datanode will be identifed as CORRUPT and the corrupt > count will be 1. > 6. Failover to other Namenode. > 7. Wait for all pending IBR processing. > 8. The corrupt count will not be same, and the FSCK won't show the corrupt > replica. > 9. Failover back to first namenode. > 10. Corrupt count and corrupt replica will be there. > Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15187) CORRUPT replica mismatch between namenodes after failover
Ayush Saxena created HDFS-15187: --- Summary: CORRUPT replica mismatch between namenodes after failover Key: HDFS-15187 URL: https://issues.apache.org/jira/browse/HDFS-15187 Project: Hadoop HDFS Issue Type: Bug Reporter: Ayush Saxena Assignee: Ayush Saxena The corrupt replica identified by Active Namenode, isn't identified by the Other Namenode, when it is failovered to Active, in case the replica is being marked corrupt due to updatePipeline. Scenario to repro : 1. Create a file, while writing turn one datanode down, to trigger update pipeline. 2. Write some more data. 3. Close the file. 4. Turn on the shutdown datanode. 5. The replica in the datanode will be identifed as CORRUPT and the corrupt count will be 1. 6. Failover to other Namenode. 7. Wait for all pending IBR processing. 8. The corrupt count will not be same, and the FSCK won't show the corrupt replica. 9. Failover back to first namenode. 10. Corrupt count and corrupt replica will be there. Both Namenodes shows different stuff. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040973#comment-17040973 ] Hadoop QA commented on HDFS-15186: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 48s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 1s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 43s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 7s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 16m 10s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 49s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 40s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 55s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 38s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 6 new + 5 unchanged - 0 fixed = 11 total (was 5) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 3s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 21s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 37s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}107m 18s{color} | {color:red} hadoop-hdfs in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}172m 53s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestDeadNodeDetection | | | hadoop.hdfs.server.blockmanagement.TestBlockStatsMXBean | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.6 Server=19.03.6 Image:yetus/hadoop:c44943d1fc3 | | JIRA Issue | HDFS-15186 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12993977/HDFS-15186.001.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 22a273674b3b 4.15.0-74-generic #84-Ubuntu SMP Thu Dec 19 08:06:28 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / ec75071 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_232 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28813/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28813/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28813/testReport/ | | Max. process+thread
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15186: --- Target Version/s: 3.3.0, 3.1.4, 3.2.2 (was: 3.3.0) > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei-Chiu Chuang updated HDFS-15186: --- Fix Version/s: (was: 3.3.0) > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15120) Refresh BlockPlacementPolicy at runtime.
[ https://issues.apache.org/jira/browse/HDFS-15120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040876#comment-17040876 ] Jinglun commented on HDFS-15120: The failed test is unrelated. Hi [~ayushtkn], would you help to review v05, thanks ! > Refresh BlockPlacementPolicy at runtime. > > > Key: HDFS-15120 > URL: https://issues.apache.org/jira/browse/HDFS-15120 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Jinglun >Assignee: Jinglun >Priority: Major > Attachments: HDFS-15120.001.patch, HDFS-15120.002.patch, > HDFS-15120.003.patch, HDFS-15120.004.patch, HDFS-15120.005.patch > > > Now if we want to switch BlockPlacementPolicies we need to restart the > NameNode. It would be convenient if we can switch it at runtime. For example > we can switch between AvailableSpaceBlockPlacementPolicy and > BlockPlacementPolicyDefault as needed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena reassigned HDFS-15186: --- Assignee: Yao Guangdong > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Assignee: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ayush Saxena updated HDFS-15186: Status: Patch Available (was: Open) > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.1.3, 3.2.1, 3.0.3 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15177) Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too much time.
[ https://issues.apache.org/jira/browse/HDFS-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040816#comment-17040816 ] zhuqi commented on HDFS-15177: -- Hi [~sodonnell] Thanks for your reply. I will monitor the FoldedTreeSet problem such as HDFS-15131. And you said on the 3.x branch, the locking in the DN has been changed to a fair lock for some time now, but i find the AutoCloseableLock uses the ReentrantLock and it default uses NonfairSync , and when will the DN uses the fair lock? > Split datanode invalide block deletion, to avoid the FsDatasetImpl lock too > much time. > -- > > Key: HDFS-15177 > URL: https://issues.apache.org/jira/browse/HDFS-15177 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: image-2020-02-18-22-39-00-642.png, > image-2020-02-18-22-51-28-624.png, image-2020-02-18-22-52-59-202.png, > image-2020-02-18-22-55-38-661.png > > > In our cluster, the datanode receive the delete command with too many blocks > deletion when we have many blockpools sharing the same datanode and the > datanode with about 30 storage dirs, it will cause the FsDatasetImpl lock too > much time. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040794#comment-17040794 ] Yao Guangdong commented on HDFS-15186: -- Added the initial patch. Please check it once. [~weichiu] PTAL > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yao Guangdong updated HDFS-15186: - Attachment: HDFS-15186.001.patch > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > Attachments: HDFS-15186.001.patch > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040767#comment-17040767 ] HuangTao commented on HDFS-15186: - In our production, we meet same issue, and I have discussed with [~yaoguangdong] about this offline. [~weichiu] PTAL > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yao Guangdong updated HDFS-15186: - Affects Version/s: (was: 3.0.2) 3.0.3 3.2.1 3.1.3 > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yao Guangdong updated HDFS-15186: - Fix Version/s: (was: 3.2.2) (was: 3.1.4) (was: 3.0.4) > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.2 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.3.0 > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
[ https://issues.apache.org/jira/browse/HDFS-15186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yao Guangdong updated HDFS-15186: - Affects Version/s: (was: 3.1.3) (was: 3.2.1) (was: 3.0.3) 3.0.2 > Erasure Coding: Decommission may generate the parity block's content with all > 0 in some case > > > Key: HDFS-15186 > URL: https://issues.apache.org/jira/browse/HDFS-15186 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding >Affects Versions: 3.0.2 >Reporter: Yao Guangdong >Priority: Critical > Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2 > > > I can find some parity block's content with all 0 when i decommission some > DataNode(more than 1) from a cluster. And the probability is very big(parts > per thousand).This is a big problem.You can think that if we read data from > the zero parity block or use the zero parity block to recover a block which > can make us use the error data even we don't know it. > There is some case in the below: > B: Busy DataNode, > D:Decommissioning DataNode, > Others is normal. > 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. > > In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), > 7, 8(D)], the DN may received reconstruct block command and the > liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in > the class StripedReconstructionInfo) length is 2. > The targets's length is 2 which mean that the DataNode need recover 2 > internal block in current code.But from the liveIndices we only can find 1 > missing block, so the method StripedWriter#initTargetIndices will use 0 as > the default recover block and don't care the indices 0 is in the sources > indices or not. > When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] > use the ec algorithm.We can find that the indices [0] is in the both the > sources indices and the targets indices in this case. The returned target > buffer in the indices [6] is always 0 from the ec algorithm.So I think this > is the ec algorithm's problem. Because it should more fault tolerance.I try > to fixed it .But it is too hard. Because the case is too more. The second is > another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to > recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm > with a correct parameters. Which mean that remove the duplicate target > indices 0 in this case.Finally, I fixed it in this way. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15186) Erasure Coding: Decommission may generate the parity block's content with all 0 in some case
Yao Guangdong created HDFS-15186: Summary: Erasure Coding: Decommission may generate the parity block's content with all 0 in some case Key: HDFS-15186 URL: https://issues.apache.org/jira/browse/HDFS-15186 Project: Hadoop HDFS Issue Type: Bug Components: datanode, erasure-coding Affects Versions: 3.1.3, 3.2.1, 3.0.3 Reporter: Yao Guangdong Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2 I can find some parity block's content with all 0 when i decommission some DataNode(more than 1) from a cluster. And the probability is very big(parts per thousand).This is a big problem.You can think that if we read data from the zero parity block or use the zero parity block to recover a block which can make us use the error data even we don't know it. There is some case in the below: B: Busy DataNode, D:Decommissioning DataNode, Others is normal. 1.Group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. 2.Group indices is [0(B,D), 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)]. In the first case when the block group indices is [0, 1, 2, 3, 4, 5, 6(B,D), 7, 8(D)], the DN may received reconstruct block command and the liveIndices=[0, 1, 2, 3, 4, 5, 7, 8] and the targets's(the field which in the class StripedReconstructionInfo) length is 2. The targets's length is 2 which mean that the DataNode need recover 2 internal block in current code.But from the liveIndices we only can find 1 missing block, so the method StripedWriter#initTargetIndices will use 0 as the default recover block and don't care the indices 0 is in the sources indices or not. When they use sources indices [0, 1, 2, 3, 4, 5] to recover indices [6, 0] use the ec algorithm.We can find that the indices [0] is in the both the sources indices and the targets indices in this case. The returned target buffer in the indices [6] is always 0 from the ec algorithm.So I think this is the ec algorithm's problem. Because it should more fault tolerance.I try to fixed it .But it is too hard. Because the case is too more. The second is another case in the example above(use sources indices [1, 2, 3, 4, 5, 7] to recover indices [0, 6, 0]). So I changed my mind.Invoke the ec algorithm with a correct parameters. Which mean that remove the duplicate target indices 0 in this case.Finally, I fixed it in this way. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org