[jira] [Updated] (HDDS-2255) Improve Acl Handler Messages
[ https://issues.apache.org/jira/browse/HDDS-2255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDDS-2255: - Labels: newbie pull-request-available (was: newbie) > Improve Acl Handler Messages > > > Key: HDDS-2255 > URL: https://issues.apache.org/jira/browse/HDDS-2255 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: om >Reporter: Hanisha Koneru >Assignee: YiSheng Lien >Priority: Minor > Labels: newbie, pull-request-available > > In Add/Remove/Set Acl Key/Bucket/Volume Handlers, we print a message about > whether the operation was successful or not. If we are trying to add an ACL > which is already existing, we convey the message that the operation failed. > It would be better if the message conveyed more clearly why the operation > failed i.e. the ACL already exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDDS-2255) Improve Acl Handler Messages
[ https://issues.apache.org/jira/browse/HDDS-2255?focusedWorklogId=334892=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-334892 ] ASF GitHub Bot logged work on HDDS-2255: Author: ASF GitHub Bot Created on: 28/Oct/19 11:16 Start Date: 28/Oct/19 11:16 Worklog Time Spent: 10m Work Description: cxorm commented on pull request #94: HDDS-2255. Improve Acl Handler Messages URL: https://github.com/apache/hadoop-ozone/pull/94 ## What changes were proposed in this pull request? Add a ```checkAclExist()``` method in ```ObjectStore.java``` and let the method called by ```addAclHandler``` to show the proper messages. ## What is the link to the Apache JIRA https://issues.apache.org/jira/browse/HDDS-2255 ## How was this patch tested? Build and check the message. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 334892) Remaining Estimate: 0h Time Spent: 10m > Improve Acl Handler Messages > > > Key: HDDS-2255 > URL: https://issues.apache.org/jira/browse/HDDS-2255 > Project: Hadoop Distributed Data Store > Issue Type: Improvement > Components: om >Reporter: Hanisha Koneru >Assignee: YiSheng Lien >Priority: Minor > Labels: newbie, pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In Add/Remove/Set Acl Key/Bucket/Volume Handlers, we print a message about > whether the operation was successful or not. If we are trying to add an ACL > which is already existing, we convey the message that the operation failed. > It would be better if the message conveyed more clearly why the operation > failed i.e. the ACL already exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDDS-2371) Print out the ozone version during the startup instead of hadoop version
[ https://issues.apache.org/jira/browse/HDDS-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] YiSheng Lien reassigned HDDS-2371: -- Assignee: YiSheng Lien > Print out the ozone version during the startup instead of hadoop version > > > Key: HDDS-2371 > URL: https://issues.apache.org/jira/browse/HDDS-2371 > Project: Hadoop Distributed Data Store > Issue Type: Improvement >Reporter: Marton Elek >Assignee: YiSheng Lien >Priority: Major > Labels: newbie > > Ozone components printing out the current version during the startup: > > {code:java} > STARTUP_MSG: Starting StorageContainerManager > STARTUP_MSG: host = om/10.8.0.145 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 3.2.0 > STARTUP_MSG: build = https://github.com/apache/hadoop.git -r > e97acb3bd8f3befd27418996fa5d4b50bf2e17bf; compiled by 'sunilg' on > 2019-01-{code} > But as it's visible the build / compiled information is about hadoop not > about hadoop-ozone. > (And personally I prefer to use a github compatible url instead of the SVN > style -r. Something like: > {code:java} > STARTUP_MSG: build = > https://github.com/apache/hadoop-ozone/commit/8541c5694efebb58f53cf4665d3e4e6e4a12845c > ; compiled by '' on ...{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException
[ https://issues.apache.org/jira/browse/HDFS-14937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xuzq updated HDFS-14937: Attachment: HDFS-14937-trunk-001.patch Status: Patch Available (was: Open) > [SBN read] ObserverReadProxyProvider should throw InterruptException > > > Key: HDFS-14937 > URL: https://issues.apache.org/jira/browse/HDFS-14937 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: xuzq >Assignee: xuzq >Priority: Major > Attachments: HDFS-14937-trunk-001.patch > > > ObserverReadProxyProvider should throw InterruptException immediately if one > Observer catch InterruptException in invoking. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2372) Datanode pipeline is failing with NoSuchFileException
Marton Elek created HDDS-2372: - Summary: Datanode pipeline is failing with NoSuchFileException Key: HDDS-2372 URL: https://issues.apache.org/jira/browse/HDDS-2372 Project: Hadoop Distributed Data Store Issue Type: Bug Reporter: Marton Elek Found it on a k8s based test cluster using a simple 3 node cluster and HDDS-2327 freon test. After a while the StateMachine become unhealthy after this error: {code:java} datanode-0 datanode java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.apache.hadoop.hdds.scm.container.common.helpers.StorageContainerException: java.nio.file.NoSuchFileException: /data/storage/hdds/2a77fab9-9dc5-4f73-9501-b5347ac6145c/current/containerDir0/1/chunks/gGYYgiTTeg_testdata_chunk_13931.tmp.2.20830 {code} Can be reproduced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14937) [SBN read] ObserverReadProxyProvider should throw InterruptException
xuzq created HDFS-14937: --- Summary: [SBN read] ObserverReadProxyProvider should throw InterruptException Key: HDFS-14937 URL: https://issues.apache.org/jira/browse/HDFS-14937 Project: Hadoop HDFS Issue Type: Improvement Reporter: xuzq Assignee: xuzq ObserverReadProxyProvider should throw InterruptException immediately if one Observer catch InterruptException in invoking. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDDS-2371) Print out the ozone version during the startup instead of hadoop version
Marton Elek created HDDS-2371: - Summary: Print out the ozone version during the startup instead of hadoop version Key: HDDS-2371 URL: https://issues.apache.org/jira/browse/HDDS-2371 Project: Hadoop Distributed Data Store Issue Type: Improvement Reporter: Marton Elek Ozone components printing out the current version during the startup: {code:java} STARTUP_MSG: Starting StorageContainerManager STARTUP_MSG: host = om/10.8.0.145 STARTUP_MSG: args = [] STARTUP_MSG: version = 3.2.0 STARTUP_MSG: build = https://github.com/apache/hadoop.git -r e97acb3bd8f3befd27418996fa5d4b50bf2e17bf; compiled by 'sunilg' on 2019-01-{code} But as it's visible the build / compiled information is about hadoop not about hadoop-ozone. (And personally I prefer to use a github compatible url instead of the SVN style -r. Something like: {code:java} STARTUP_MSG: build = https://github.com/apache/hadoop-ozone/commit/8541c5694efebb58f53cf4665d3e4e6e4a12845c ; compiled by '' on ...{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960879#comment-16960879 ] Hadoop QA commented on HDFS-14768: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 46s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 19m 42s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 59s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 48s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 6s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 14m 30s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 14s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 12s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 54s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 41s{color} | {color:green} hadoop-hdfs-project/hadoop-hdfs: The patch generated 0 new + 201 unchanged - 1 fixed = 201 total (was 202) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 2s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 13m 27s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 19s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 13s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}103m 39s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 33s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}165m 59s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.server.namenode.ha.TestPipelinesFailover | | | hadoop.hdfs.server.namenode.ha.TestBootstrapStandby | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14768 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12984142/HDFS-14768.010.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 50281e41fa8c 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7be5508 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | unit | https://builds.apache.org/job/PreCommit-HDFS-Build/28188/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt | | Test Results | https://builds.apache.org/job/PreCommit-HDFS-Build/28188/testReport/ | | Max. process+thread count | 2669 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | |
[jira] [Reopened] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Li Cheng reopened HDDS-2356: Assignee: (was: Bharat Viswanadham) > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Priority: Blocker > Fix For: 0.5.0 > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with > exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > java.util.ConcurrentModificationException > at java.util.TreeMap.forEach(TreeMap.java:1004) > at > org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31) > at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > at > org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135) > at java.lang.Thread.run(Thread.java:745) > 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960870#comment-16960870 ] Li Cheng commented on HDDS-2356: I take this Jira to track issues that seem to be related with Multipart upload in my testing. Reopen this. > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Assignee: Bharat Viswanadham >Priority: Blocker > Fix For: 0.5.0 > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with > exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > java.util.ConcurrentModificationException > at java.util.TreeMap.forEach(TreeMap.java:1004) > at > org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31) > at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > at > org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135) > at java.lang.Thread.run(Thread.java:745) > 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960866#comment-16960866 ] Li Cheng commented on HDDS-2356: MISMATCH_MULTIPART_LIST seems to be a recurring error. Never be able to finish this. > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Assignee: Bharat Viswanadham >Priority: Blocker > Fix For: 0.5.0 > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with > exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > java.util.ConcurrentModificationException > at java.util.TreeMap.forEach(TreeMap.java:1004) > at > org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31) > at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > at > org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135) > at java.lang.Thread.run(Thread.java:745) > 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14936) Add getNumOfChildren() for interface InnerNode
[ https://issues.apache.org/jira/browse/HDFS-14936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lisheng Sun updated HDFS-14936: --- Attachment: HDFS-14936.001.patch Status: Patch Available (was: Open) > Add getNumOfChildren() for interface InnerNode > -- > > Key: HDFS-14936 > URL: https://issues.apache.org/jira/browse/HDFS-14936 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Lisheng Sun >Priority: Minor > Attachments: HDFS-14936.001.patch > > > current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both > have getNumOfChildren(). > so Add getNumOfChildren() for interface InnerNode and remove unnessary > getNumOfChildren() in DFSTopologyNodeImpl. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14936) Add getNumOfChildren() for interface InnerNode
Lisheng Sun created HDFS-14936: -- Summary: Add getNumOfChildren() for interface InnerNode Key: HDFS-14936 URL: https://issues.apache.org/jira/browse/HDFS-14936 Project: Hadoop HDFS Issue Type: Improvement Reporter: Lisheng Sun current code InnerNode subclass InnerNodeImpl and DFSTopologyNodeImpl both have getNumOfChildren(). so Add getNumOfChildren() for interface InnerNode and remove unnessary getNumOfChildren() in DFSTopologyNodeImpl. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960850#comment-16960850 ] guojh commented on HDFS-14768: -- [~surendrasingh] Thanks for you UT, I update the code and fixed the checkstyle error. Please review it. Thank you very much! > EC : Busy DN replica should be consider in live replica check. > -- > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, > HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, > guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, > zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); >
[jira] [Comment Edited] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
[ https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960845#comment-16960845 ] Takanobu Asanuma edited comment on HDFS-14907 at 10/28/19 8:07 AM: --- Thanks for your advice, [~xkrogen]. start-component.sh looks the good place. I sent a PR. I've confirmed that dyno-datanodes run successfully with the PR. was (Author: tasanuma0829): Thanks for your advice, [~xkrogen]. start-component.sh looks the good place. I sent a PR. > [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary > -- > > Key: HDFS-14907 > URL: https://issues.apache.org/jira/browse/HDFS-14907 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > > When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, > datanodes fail to run with the following log and > {{start-dynamometer-cluster.sh}} fails. > {noformat} > LogType:stderr > LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019 > LogLength:1386 > LogContents: > Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert > at > org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252) > at > org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982) > at > org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972) > at > org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88) > Caused by: java.lang.ClassNotFoundException: org.junit.Assert > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 9 more > ./start-component.sh: line 317: kill: (2261) - No such process > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
[ https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takanobu Asanuma updated HDFS-14907: Status: Patch Available (was: Open) > [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary > -- > > Key: HDFS-14907 > URL: https://issues.apache.org/jira/browse/HDFS-14907 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > > When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, > datanodes fail to run with the following log and > {{start-dynamometer-cluster.sh}} fails. > {noformat} > LogType:stderr > LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019 > LogLength:1386 > LogContents: > Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert > at > org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252) > at > org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982) > at > org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972) > at > org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88) > Caused by: java.lang.ClassNotFoundException: org.junit.Assert > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 9 more > ./start-component.sh: line 317: kill: (2261) - No such process > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
[ https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960845#comment-16960845 ] Takanobu Asanuma commented on HDFS-14907: - Thanks for your advice, [~xkrogen]. start-component.sh looks the good place. I sent a PR. > [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary > -- > > Key: HDFS-14907 > URL: https://issues.apache.org/jira/browse/HDFS-14907 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > > When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, > datanodes fail to run with the following log and > {{start-dynamometer-cluster.sh}} fails. > {noformat} > LogType:stderr > LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019 > LogLength:1386 > LogContents: > Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert > at > org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252) > at > org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982) > at > org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972) > at > org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88) > Caused by: java.lang.ClassNotFoundException: org.junit.Assert > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 9 more > ./start-component.sh: line 317: kill: (2261) - No such process > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2370) Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 'ozoneplugin' is not exist anymore
[ https://issues.apache.org/jira/browse/HDDS-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960844#comment-16960844 ] luhuachao commented on HDDS-2370: - [~adoroszlai] thanks for reply , I would like to work on this. > Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir > 'ozoneplugin' is not exist anymore > -- > > Key: HDDS-2370 > URL: https://issues.apache.org/jira/browse/HDDS-2370 > Project: Hadoop Distributed Data Store > Issue Type: Task > Components: documentation >Reporter: luhuachao >Priority: Major > Attachments: HDDS-2370.1.patch > > > In RunningWithHDFS.md > {code:java} > export > HADOOP_CLASSPATH=/opt/ozone/share/hadoop/ozoneplugin/hadoop-ozone-datanode-plugin.jar{code} > ozone-hdfs/docker-compose.yaml > > {code:java} > environment: > HADOOP_CLASSPATH: /opt/ozone/share/hadoop/ozoneplugin/*.jar > {code} > when i run hddsdatanodeservice as pulgin in hdfs datanode, it comes out with > the error below , there is no constructor without parameter. > > > {code:java} > 2019-10-21 21:38:56,391 ERROR datanode.DataNode > (DataNode.java:startPlugins(972)) - Unable to load DataNode plugins. > Specified list of plugins: org.apache.hadoop.ozone.HddsDatanodeService > java.lang.RuntimeException: java.lang.NoSuchMethodException: > org.apache.hadoop.ozone.HddsDatanodeService.() > {code} > what i doubt is that, ozone-0.5 not support running as a plugin in hdfs > datanode now ? if so, > why donnot we remove doc RunningWithHDFS.md ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14907) [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary
[ https://issues.apache.org/jira/browse/HDFS-14907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takanobu Asanuma reassigned HDFS-14907: --- Assignee: Takanobu Asanuma > [Dynamometer] DataNode can't find junit jar when using Hadoop-3 binary > -- > > Key: HDFS-14907 > URL: https://issues.apache.org/jira/browse/HDFS-14907 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Takanobu Asanuma >Assignee: Takanobu Asanuma >Priority: Major > > When executing {{start-dynamometer-cluster.sh}} with Hadoop-3 binary, > datanodes fail to run with the following log and > {{start-dynamometer-cluster.sh}} fails. > {noformat} > LogType:stderr > LogLastModifiedTime:Wed Oct 09 15:03:09 +0900 2019 > LogLength:1386 > LogContents: > Exception in thread "main" java.lang.NoClassDefFoundError: org/junit/Assert > at > org.apache.hadoop.test.GenericTestUtils.assertExists(GenericTestUtils.java:299) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:243) > at > org.apache.hadoop.test.GenericTestUtils.getTestDir(GenericTestUtils.java:252) > at > org.apache.hadoop.hdfs.MiniDFSCluster.getBaseDirectory(MiniDFSCluster.java:2982) > at > org.apache.hadoop.hdfs.MiniDFSCluster.determineDfsBaseDir(MiniDFSCluster.java:2972) > at > org.apache.hadoop.hdfs.MiniDFSCluster.formatDataNodeDirs(MiniDFSCluster.java:2834) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.run(SimulatedDataNodes.java:123) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.tools.dynamometer.SimulatedDataNodes.main(SimulatedDataNodes.java:88) > Caused by: java.lang.ClassNotFoundException: org.junit.Assert > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) > at java.lang.ClassLoader.loadClass(ClassLoader.java:424) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) > at java.lang.ClassLoader.loadClass(ClassLoader.java:357) > ... 9 more > ./start-component.sh: line 317: kill: (2261) - No such process > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960829#comment-16960829 ] Fei Hui edited comment on HDFS-14920 at 10/28/19 7:58 AM: -- {quote} other storages contains this internal block should be decommissioning {quote} This comment is wrong, have modified it. The function *countReplicasForStripedBlock* is used for recomputing the LIVE replica for the same internal block. One case use it. {code} // Count replicas on decommissioning nodes, as these will not be // decommissioned unless recovery/completing last block has finished NumberReplicas numReplicas = countNodes(lastBlock); int numUsableReplicas = numReplicas.liveReplicas() + numReplicas.decommissioning() + numReplicas.liveEnteringMaintenanceReplicas(); {code} I think if the same internal block is contains liveReplicas, and it is also contains decommissioning replicas. numReplicas.liveReplicas() + numReplicas.decommissioning() will not make sense. So I think the same internal block is ether in liveReplicas or in decommissioning replicas, but not both. was (Author: ferhui): {quote} other storages contains this internal block should be decommissioning {quote} This comment is error, have modified it. The function *countReplicasForStripedBlock* is used for recomputing the LIVE replica for the same internal block. One case use it. {code} // Count replicas on decommissioning nodes, as these will not be // decommissioned unless recovery/completing last block has finished NumberReplicas numReplicas = countNodes(lastBlock); int numUsableReplicas = numReplicas.liveReplicas() + numReplicas.decommissioning() + numReplicas.liveEnteringMaintenanceReplicas(); {code} I think if the same internal block is contains liveReplicas, and it is also contains decommissioning replicas. numReplicas.liveReplicas() + numReplicas.decommissioning() will not make sense. So I think the same internal block is ether in liveReplicas or in decommissioning replicas, but not both. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > //
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960829#comment-16960829 ] Fei Hui commented on HDFS-14920: {quote} other storages contains this internal block should be decommissioning {quote} This comment is error, have modified it. The function *countReplicasForStripedBlock* is used for recomputing the LIVE replica for the same internal block. One case use it. {code} // Count replicas on decommissioning nodes, as these will not be // decommissioned unless recovery/completing last block has finished NumberReplicas numReplicas = countNodes(lastBlock); int numUsableReplicas = numReplicas.liveReplicas() + numReplicas.decommissioning() + numReplicas.liveEnteringMaintenanceReplicas(); {code} I think if the same internal block is contains liveReplicas, and it is also contains decommissioning replicas. numReplicas.liveReplicas() + numReplicas.decommissioning() will not make sense. So I think the same internal block is ether in liveReplicas or in decommissioning replicas, but not both. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail:
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960836#comment-16960836 ] Fei Hui commented on HDFS-14920: [~gjhkael] Yes. You got it! > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960817#comment-16960817 ] guojh commented on HDFS-14920: -- Thanks [~ferhui] I am clear now. You just want to correct the decommissioning counter, If index[0, 1, 2, 3] is decommissioning, and another block with index 0 is replica success, Now decommission counter should be 3 not 4. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960795#comment-16960795 ] Fei Hui edited comment on HDFS-14920 at 10/28/19 7:23 AM: -- [~ayushtkn] Thanks for your review {code} // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } {code} We set the *blockIndex* internal block. Because having enter if clause as bellow {code} if (state == StoredReplicaState.LIVE) { {code} If the *blockIndex* internal block is in live state, this block in other storages should not be decommissioning while we compute live and decommissioning replicas. The *blockIndex* internal block will be live or decommissioning, it could not be both live and decommissioning. was (Author: ferhui): [~ayushtkn] Thanks for your review {code} // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } {code} We set the *blockIndex* internal block. Because having enter if clause as bellow {code} if (state == StoredReplicaState.LIVE) { {code} If the *blockIndex* internal block is in live state, other storages contains this internal block should be decommissioning while we compute live and decommissioning replicas. The *blockIndex* internal block will be live or decommissioning, it could not be both live and decommissioning. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960813#comment-16960813 ] Ayush Saxena commented on HDFS-14920: - Where are you using this {{decommissioningBitSet.set(blockIndex);}} of the else part? {quote}other storages contains this internal block should be decommissioning {quote} There is a *"should be"*, we aren't sure of all the cases, may be theoretically it should be, but may be a bug, it might not and may lead to some very abnormal scenarios, I don't think putting it in decommissioningBit, without finding it be in the decommissioning state is a good idea. [~gjhkael] had some questions too, can you clarify his doubts? > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2370) Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir 'ozoneplugin' is not exist anymore
[ https://issues.apache.org/jira/browse/HDDS-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960809#comment-16960809 ] Attila Doroszlai commented on HDDS-2370: Thanks for reporting this problem [~Huachao]. I think datanode plugin behavior should be fixed instead of removing it from the documentation. Would you like to work on it, or may I? > Remove classpath in RunningWithHDFS.md ozone-hdfs/docker-compose as dir > 'ozoneplugin' is not exist anymore > -- > > Key: HDDS-2370 > URL: https://issues.apache.org/jira/browse/HDDS-2370 > Project: Hadoop Distributed Data Store > Issue Type: Task > Components: documentation >Reporter: luhuachao >Priority: Major > Attachments: HDDS-2370.1.patch > > > In RunningWithHDFS.md > {code:java} > export > HADOOP_CLASSPATH=/opt/ozone/share/hadoop/ozoneplugin/hadoop-ozone-datanode-plugin.jar{code} > ozone-hdfs/docker-compose.yaml > > {code:java} > environment: > HADOOP_CLASSPATH: /opt/ozone/share/hadoop/ozoneplugin/*.jar > {code} > when i run hddsdatanodeservice as pulgin in hdfs datanode, it comes out with > the error below , there is no constructor without parameter. > > > {code:java} > 2019-10-21 21:38:56,391 ERROR datanode.DataNode > (DataNode.java:startPlugins(972)) - Unable to load DataNode plugins. > Specified list of plugins: org.apache.hadoop.ozone.HddsDatanodeService > java.lang.RuntimeException: java.lang.NoSuchMethodException: > org.apache.hadoop.ozone.HddsDatanodeService.() > {code} > what i doubt is that, ozone-0.5 not support running as a plugin in hdfs > datanode now ? if so, > why donnot we remove doc RunningWithHDFS.md ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960803#comment-16960803 ] Fei Hui commented on HDFS-14920: [~gjhkael] Thanks for review. We should consider this issue overall and find the root cause and then fix it. I think the uploaded patch is also simple and I have considered your approach. Maybe we should consider the annotation as follow. {quote} // should reconstruct all the internal blocks before scheduling // replication task for decommissioning node(s). {quote} The patch can resolve this issue. With the patch targets is 1, rather than 4, reconstruction will success. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2322) DoubleBuffer flush termination and OM shutdown's after that.
[ https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960802#comment-16960802 ] Li Cheng commented on HDDS-2322: https://issues.apache.org/jira/browse/HDDS-2356 is still having issues. Do you mean to track here? [~bharat] > DoubleBuffer flush termination and OM shutdown's after that. > > > Key: HDDS-2322 > URL: https://issues.apache.org/jira/browse/HDDS-2322 > Project: Hadoop Distributed Data Store > Issue Type: Task >Reporter: Bharat Viswanadham >Assignee: Bharat Viswanadham >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > om1_1 | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR > - Terminating with exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > om1_1 | java.util.ConcurrentModificationException > om1_1 | at > java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > om1_1 | at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > om1_1 | at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > om1_1 | at > org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65) > om1_1 | at > java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) > om1_1 | at > java.base/java.util.Collections$2.tryAdvance(Collections.java:4745) > om1_1 | at > java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) > om1_1 | at > java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913) > om1_1 | at > java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > om1_1 | at > java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578) > om1_1 | at > org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362) > om1_1 | at > org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37) > om1_1 | at > org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31) > om1_1 | at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > om1_1 | at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > om1_1 | at > org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58) > om1_1 | at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139) > om1_1 | at > java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) > om1_1 | at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137) > om1_1 | at java.base/java.lang.Thread.run(Thread.java:834) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960799#comment-16960799 ] Li Cheng commented on HDDS-2356: [~bharat] In term of reproduction, I have a dataset which includes small files as well as big files and I'm using s3 gateway from ozone and mount ozone cluster to a local path by goofys. All the data are recursively written to the mount path, which essentially leads to ozone cluster. The ozone cluster is deployed on a 3-node VMs env and each VM has only 1 disk for ozone data writing. I think it's a pretty simple scenario to reproduce. The solely operation is writing to ozone cluster thru fuse. > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >Affects Versions: 0.4.1 > Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM > on a separate VM >Reporter: Li Cheng >Assignee: Bharat Viswanadham >Priority: Blocker > Fix For: 0.5.0 > > > Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say > it's VM0. > I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path > on VM0, while reading data from VM0 local disk and write to mount path. The > dataset has various sizes of files from 0 byte to GB-level and it has a > number of ~50,000 files. > The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I > look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors > related with Multipart upload. This error eventually causes the writing to > terminate and OM to be closed. > > 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with > exit status 2: OMDoubleBuffer flush > threadOMDoubleBufferFlushThreadencountered Throwable error > java.util.ConcurrentModificationException > at java.util.TreeMap.forEach(TreeMap.java:1004) > at > org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38) > at > org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31) > at > org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68) > at > org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125) > at > org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137) > at java.util.Iterator.forEachRemaining(Iterator.java:116) > at > org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135) > at java.lang.Thread.run(Thread.java:745) > 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG: -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-13736) BlockPlacementPolicyDefault can not choose favored nodes when 'dfs.namenode.block-placement-policy.default.prefer-local-node' set to false
[ https://issues.apache.org/jira/browse/HDFS-13736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960798#comment-16960798 ] Ayush Saxena commented on HDFS-13736: - Thanx [~xiaodong.hu] for the patch. Seems fair enough. There is a checkstyle warning, I think we can't do anything for it, We can live with it. [~hexiaoqiao] if you get a chance, Can you also give a check once, Plan to push this may be by tomorrow, if no comments!!! > BlockPlacementPolicyDefault can not choose favored nodes when > 'dfs.namenode.block-placement-policy.default.prefer-local-node' set to false > -- > > Key: HDFS-13736 > URL: https://issues.apache.org/jira/browse/HDFS-13736 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.0 >Reporter: hu xiaodong >Assignee: hu xiaodong >Priority: Major > Attachments: HDFS-13736.001.patch, HDFS-13736.002.patch, > HDFS-13736.003.patch, HDFS-13736.004.patch, HDFS-13736.005.patch, > HDFS-13736.006.patch > > > BlockPlacementPolicyDefault can not choose favored nodes when > 'dfs.namenode.block-placement-policy.default.prefer-local-node' set to false. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline
[ https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960796#comment-16960796 ] Li Cheng commented on HDDS-2356: Also it's print the same pipeline id in s3g logs like crazy. Wonder if that's expected. [~bharat] 2019-10-28 11:43:08,912 [qtp1383524016-24] INFO - Allocating block with ExcludeList \{datanodes = [], containerIds = [], pipelineIds = []} ...skipping... eID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a > Multipart upload report errors while writing to ozone Ratis pipeline > > > Key: HDDS-2356 > URL: https://issues.apache.org/jira/browse/HDDS-2356 > Project: Hadoop Distributed Data Store > Issue Type: Bug > Components: Ozone Manager >
[jira] [Comment Edited] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960795#comment-16960795 ] Fei Hui edited comment on HDFS-14920 at 10/28/19 6:39 AM: -- [~ayushtkn] Thanks for your review {code} // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } {code} We set the *blockIndex* internal block. Because having enter if clause as bellow {code} if (state == StoredReplicaState.LIVE) { {code} If the *blockIndex* internal block is in live state, other storages contains this internal block should be decommissioning while we compute live and decommissioning replicas. The *blockIndex* internal block will be live or decommissioning, it could not be both live and decommissioning. was (Author: ferhui): [~ayushtkn] Thanks for your review {code} // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } {code} We set the *blockIndex* internal block. Because having enter if clause as bellow {code} if (state == StoredReplicaState.LIVE) { {code} It the *blockIndex* internal block is in live state, other storages contains this internal block should be decommissioning while we compute live and decommissioning replicas. The *blockIndex* internal block will be live or decommissioning, it could not be both live and decommissioning. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960795#comment-16960795 ] Fei Hui commented on HDFS-14920: [~ayushtkn] Thanks for your review {code} // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } {code} We set the *blockIndex* internal block. Because having enter if clause as bellow {code} if (state == StoredReplicaState.LIVE) { {code} It the *blockIndex* internal block is in live state, other storages contains this internal block should be decommissioning while we compute live and decommissioning replicas. The *blockIndex* internal block will be live or decommissioning, it could not be both live and decommissioning. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14284) RBF: Log Router identifier when reporting exceptions
[ https://issues.apache.org/jira/browse/HDFS-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960790#comment-16960790 ] Ayush Saxena commented on HDFS-14284: - Thanx [~hemanthboyina] for the patch. Couple of comments : {code:java} -throw new NoNamenodesAvailableException(nsId, ioe); +throw new NoNamenodesAvailableException( +nsId + " from router " + router.getRouterId(), ioe); {code} No need to append the text here in the nsId variable, Doesn't make sense to have message for a variable which intends to store NsId, Add a param for Router ID, and do the message appending part and all inside the Exception method, To make the actual code flow look clean. {code:java} - throw new IOException("No namenodes to invoke " + method.getName() + - " with params " + Arrays.deepToString(params) + " from " - + router.getRouterId()); + throw new RouterIOException("No namenodes to invoke " + method.getName() + + " with params " + Arrays.deepToString(params), + router.getRouterId()); {code} If I see earlier the text was from ROUTERID not from router ROUTERID {code:java} .append(" from router ") {code} So, better we keep the text same, don't add router here, Somebody parsing the string would fail, if we tweak the text. For the test : * Derrive the RouterIOException from the RemoteException, And check the {{getMessage}} and {{getRouterID}} are giving correct stuff. * No need to remove the NoNamenodeException Test, We are changing that too, Better keep that too, If the both exceptions are using some code flow, put them in same test and name the test a little genreic, otherwise sepaerate them, But try to reuse the code if possible, by refactoring into a method, if you can't keep in one test. * > RBF: Log Router identifier when reporting exceptions > > > Key: HDFS-14284 > URL: https://issues.apache.org/jira/browse/HDFS-14284 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14284.001.patch, HDFS-14284.002.patch, > HDFS-14284.003.patch, HDFS-14284.004.patch, HDFS-14284.005.patch, > HDFS-14284.006.patch, HDFS-14284.007.patch > > > The typical setup is to use multiple Routers through > ConfiguredFailoverProxyProvider. > In a regular HA Namenode setup, it is easy to know which NN was used. > However, in RBF, any Router can be the one reporting the exception and it is > hard to know which was the one. > We should have a way to identify which Router/Namenode was the one triggering > the exception. > This would also apply with Observer Namenodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-14284) RBF: Log Router identifier when reporting exceptions
[ https://issues.apache.org/jira/browse/HDFS-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960790#comment-16960790 ] Ayush Saxena edited comment on HDFS-14284 at 10/28/19 6:30 AM: --- Thanx [~hemanthboyina] for the patch. Couple of comments : {code:java} -throw new NoNamenodesAvailableException(nsId, ioe); +throw new NoNamenodesAvailableException( +nsId + " from router " + router.getRouterId(), ioe); {code} No need to append the text here in the nsId variable, Doesn't make sense to have message for a variable which intends to store NsId, Add a param for Router ID, and do the message appending part and all inside the Exception method, To make the actual code flow look clean. {code:java} - throw new IOException("No namenodes to invoke " + method.getName() + - " with params " + Arrays.deepToString(params) + " from " - + router.getRouterId()); + throw new RouterIOException("No namenodes to invoke " + method.getName() + + " with params " + Arrays.deepToString(params), + router.getRouterId()); {code} If I see earlier the text was from ROUTERID not from router ROUTERID {code:java} .append(" from router ") {code} So, better we keep the text same, don't add router here, Somebody parsing the string would fail, if we tweak the text. For the test : * Derrive the RouterIOException from the RemoteException, And check the {{getMessage}} and {{getRouterID}} are giving correct stuff. * No need to remove the NoNamenodeException Test, We are changing that too, Better keep that too, If the both exceptions are using some code flow, put them in same test and name the test a little genreic, otherwise sepaerate them, But try to reuse the code if possible, by refactoring into a method, if you can't keep in one test. was (Author: ayushtkn): Thanx [~hemanthboyina] for the patch. Couple of comments : {code:java} -throw new NoNamenodesAvailableException(nsId, ioe); +throw new NoNamenodesAvailableException( +nsId + " from router " + router.getRouterId(), ioe); {code} No need to append the text here in the nsId variable, Doesn't make sense to have message for a variable which intends to store NsId, Add a param for Router ID, and do the message appending part and all inside the Exception method, To make the actual code flow look clean. {code:java} - throw new IOException("No namenodes to invoke " + method.getName() + - " with params " + Arrays.deepToString(params) + " from " - + router.getRouterId()); + throw new RouterIOException("No namenodes to invoke " + method.getName() + + " with params " + Arrays.deepToString(params), + router.getRouterId()); {code} If I see earlier the text was from ROUTERID not from router ROUTERID {code:java} .append(" from router ") {code} So, better we keep the text same, don't add router here, Somebody parsing the string would fail, if we tweak the text. For the test : * Derrive the RouterIOException from the RemoteException, And check the {{getMessage}} and {{getRouterID}} are giving correct stuff. * No need to remove the NoNamenodeException Test, We are changing that too, Better keep that too, If the both exceptions are using some code flow, put them in same test and name the test a little genreic, otherwise sepaerate them, But try to reuse the code if possible, by refactoring into a method, if you can't keep in one test. * > RBF: Log Router identifier when reporting exceptions > > > Key: HDFS-14284 > URL: https://issues.apache.org/jira/browse/HDFS-14284 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: hemanthboyina >Priority: Major > Attachments: HDFS-14284.001.patch, HDFS-14284.002.patch, > HDFS-14284.003.patch, HDFS-14284.004.patch, HDFS-14284.005.patch, > HDFS-14284.006.patch, HDFS-14284.007.patch > > > The typical setup is to use multiple Routers through > ConfiguredFailoverProxyProvider. > In a regular HA Namenode setup, it is easy to know which NN was used. > However, in RBF, any Router can be the one reporting the exception and it is > hard to know which was the one. > We should have a way to identify which Router/Namenode was the one triggering > the exception. > This would also apply with Observer Namenodes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guojh updated HDFS-14768: - Attachment: HDFS-14768.010.patch > EC : Busy DN replica should be consider in live replica check. > -- > > Key: HDFS-14768 > URL: https://issues.apache.org/jira/browse/HDFS-14768 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, erasure-coding, hdfs, namenode >Affects Versions: 3.0.2 >Reporter: guojh >Assignee: guojh >Priority: Major > Labels: patch > Attachments: 1568275810244.jpg, 1568276338275.jpg, 1568771471942.jpg, > HDFS-14768.000.patch, HDFS-14768.001.patch, HDFS-14768.002.patch, > HDFS-14768.003.patch, HDFS-14768.004.patch, HDFS-14768.005.patch, > HDFS-14768.006.patch, HDFS-14768.007.patch, HDFS-14768.008.patch, > HDFS-14768.009.patch, HDFS-14768.010.patch, HDFS-14768.jpg, > guojh_UT_after_deomission.txt, guojh_UT_before_deomission.txt, > zhaoyiming_UT_after_deomission.txt, zhaoyiming_UT_beofre_deomission.txt > > > Policy is RS-6-3-1024K, version is hadoop 3.0.2; > We suppose a file's block Index is [0,1,2,3,4,5,6,7,8], And decommission > index[3,4], increase the index 6 datanode's > pendingReplicationWithoutTargets that make it large than > replicationStreamsHardLimit(we set 14). Then, After the method > chooseSourceDatanodes of BlockMananger, the liveBlockIndices is > [0,1,2,3,4,5,7,8], Block Counter is, Live:7, Decommission:2. > In method scheduleReconstruction of BlockManager, the additionalReplRequired > is 9 - 7 = 2. After Namenode choose two target Datanode, will assign a > erasureCode task to target datanode. > When datanode get the task will build targetIndices from liveBlockIndices > and target length. the code is blow. > {code:java} > // code placeholder > targetIndices = new short[targets.length]; > private void initTargetIndices() { > BitSet bitset = reconstructor.getLiveBitSet(); > int m = 0; hasValidTargets = false; > for (int i = 0; i < dataBlkNum + parityBlkNum; i++) { > if (!bitset.get) { > if (reconstructor.getBlockLen > 0) { > if (m < targets.length) { > targetIndices[m++] = (short)i; > hasValidTargets = true; > } > } > } > } > {code} > targetIndices[0]=6, and targetIndices[1] is aways 0 from initial value. > The StripedReader is aways create reader from first 6 index block, and is > [0,1,2,3,4,5] > Use the index [0,1,2,3,4,5] to build target index[6,0] will trigger the isal > bug. the block index6's data is corruption(all data is zero). > I write a unit test can stabilize repreduce. > {code:java} > // code placeholder > private int replicationStreamsHardLimit = > DFSConfigKeys.DFS_NAMENODE_REPLICATION_STREAMS_HARD_LIMIT_DEFAULT; > numDNs = dataBlocks + parityBlocks + 10; > @Test(timeout = 24) > public void testFileDecommission() throws Exception { > LOG.info("Starting test testFileDecommission"); > final Path ecFile = new Path(ecDir, "testFileDecommission"); > int writeBytes = cellSize * dataBlocks; > writeStripedFile(dfs, ecFile, writeBytes); > Assert.assertEquals(0, bm.numOfUnderReplicatedBlocks()); > FileChecksum fileChecksum1 = dfs.getFileChecksum(ecFile, writeBytes); > final INodeFile fileNode = cluster.getNamesystem().getFSDirectory() > .getINode4Write(ecFile.toString()).asFile(); > LocatedBlocks locatedBlocks = > StripedFileTestUtil.getLocatedBlocks(ecFile, dfs); > LocatedBlock lb = dfs.getClient().getLocatedBlocks(ecFile.toString(), 0) > .get(0); > DatanodeInfo[] dnLocs = lb.getLocations(); > LocatedStripedBlock lastBlock = > (LocatedStripedBlock)locatedBlocks.getLastLocatedBlock(); > DatanodeInfo[] storageInfos = lastBlock.getLocations(); > // > DatanodeDescriptor datanodeDescriptor = > cluster.getNameNode().getNamesystem() > > .getBlockManager().getDatanodeManager().getDatanode(storageInfos[6].getDatanodeUuid()); > BlockInfo firstBlock = fileNode.getBlocks()[0]; > DatanodeStorageInfo[] dStorageInfos = bm.getStorages(firstBlock); > // the first heartbeat will consume 3 replica tasks > for (int i = 0; i <= replicationStreamsHardLimit + 3; i++) { > BlockManagerTestUtil.addBlockToBeReplicated(datanodeDescriptor, new > Block(i), > new DatanodeStorageInfo[]{dStorageInfos[0]}); > } > assertEquals(dataBlocks + parityBlocks, dnLocs.length); > int[] decommNodeIndex = {3, 4}; > final List decommisionNodes = new ArrayList(); > // add the node which will be decommissioning > decommisionNodes.add(dnLocs[decommNodeIndex[0]]); > decommisionNodes.add(dnLocs[decommNodeIndex[1]]); > decommissionNode(0, decommisionNodes, AdminStates.DECOMMISSIONED); > assertEquals(decommisionNodes.size(), fsn.getNumDecomLiveDataNodes()); >
[jira] [Commented] (HDFS-14768) EC : Busy DN replica should be consider in live replica check.
[ https://issues.apache.org/jira/browse/HDFS-14768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960786#comment-16960786 ] Hadoop QA commented on HDFS-14768: -- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 1m 4s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 2 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 20m 56s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 2s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 45s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 39s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 35s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 20s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 1m 12s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 15s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 47s{color} | {color:orange} hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 201 unchanged - 1 fixed = 203 total (was 202) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 1m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 15m 34s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 45s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 1m 18s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:red}-1{color} | {color:red} unit {color} | {color:red}115m 53s{color} | {color:red} hadoop-hdfs in the patch failed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 35s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}185m 24s{color} | {color:black} {color} | \\ \\ || Reason || Tests || | Failed junit tests | hadoop.hdfs.TestCrcCorruption | | | hadoop.hdfs.server.namenode.TestNameNodeMXBean | | | hadoop.hdfs.TestReconstructStripedFileWithRandomECPolicy | | | hadoop.hdfs.TestErasureCodingPoliciesWithRandomECPolicy | | | hadoop.hdfs.qjournal.server.TestJournalNodeSync | \\ \\ || Subsystem || Report/Notes || | Docker | Client=19.03.4 Server=19.03.4 Image:yetus/hadoop:104ccca9169 | | JIRA Issue | HDFS-14768 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12984136/HDFS-14768.009.patch | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux f3e1018a1ea1 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7be5508 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_222 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-HDFS-Build/28187/artifact/out/diff-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt | | unit |
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960783#comment-16960783 ] Ayush Saxena commented on HDFS-14920: - Thanx [~ferhui] for the patch. Couple of comments : {code:java} +BitSet liveBitSet = isStriped ? +new BitSet(((BlockInfoStriped) block).getTotalBlockNum()) : null; +BitSet decommissioningBitSet = isStriped ? new BitSet(((BlockInfoStriped) block).getTotalBlockNum()) : null;{code} No need to do the same check isStriped and do the same stuff again for both {{liveBitset}} and {{decommissioningBitSet}}, either assign {{liveBitset}} directly to {{decommissioningBitSet}}, or pull a common if check above, I prefer the first one though, if you think readability is getting compromised, may be put a one line comment if required. I have a doubt here : {code:java} if (state == StoredReplicaState.LIVE) { if (!liveBitSet.get(blockIndex)) { liveBitSet.set(blockIndex); // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } } else { counters.subtract(StoredReplicaState.LIVE, 1); counters.add(StoredReplicaState.REDUNDANT, 1); } } {code} If state for first is live, you are setting in the liveBitset because it is live, fair enough, then again if check is there if that is there in the decommissioning bitset too, you reduce the counter, thats Ok, but if it isn't there in the decommissioning bitset in the else part, you are adding it in the decommissioning Bitset why? How do you say it is decommissioning? Here : {code:java} // Sub decommissioning because the index replica is live. if (decommissioningBitSet.get(blockIndex)) { counters.subtract(StoredReplicaState.DECOMMISSIONING, 1); } else { decommissioningBitSet.set(blockIndex); } {code} Can you explain. In the tests, you can use, Lambda's Something like this : {code:java} GenericTestUtils.waitFor(new Supplier() { @Override public Boolean get() { return bm.countNodes(blockInfo).decommissioning() == numDecommission; } }, 100, 1); {code} To : {code:java} GenericTestUtils.waitFor( () -> bm.countNodes(blockInfo).decommissioning() == numDecommission, 100, 1); {code} Similarly for others. > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because
[jira] [Commented] (HDFS-14920) Erasure Coding: Decommission may hang If one or more datanodes are out of service during decommission
[ https://issues.apache.org/jira/browse/HDFS-14920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960772#comment-16960772 ] guojh commented on HDFS-14920: -- [~ferhui] If the block group not has all internal blocks, the additionalReplRequired should compute by liveBlockIndicies, just according the block's real total block and live block indicies. Is it more sample? And you patch seems not solved this issue? Have I missing anything? > Erasure Coding: Decommission may hang If one or more datanodes are out of > service during decommission > --- > > Key: HDFS-14920 > URL: https://issues.apache.org/jira/browse/HDFS-14920 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Affects Versions: 3.0.3, 3.2.1, 3.1.3 >Reporter: Fei Hui >Assignee: Fei Hui >Priority: Major > Attachments: HDFS-14920.001.patch, HDFS-14920.002.patch > > > Decommission test hangs in our clusters. > Have seen the messages as follow > {quote} > 2019-10-22 15:58:51,514 TRACE > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Block > blk_-9223372035600425840_372987973 numExpected=9, numLive=5 > 2019-10-22 15:58:51,514 INFO BlockStateChange: Block: > blk_-9223372035600425840_372987973, Expected Replicas: 9, live replicas: 5, > corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 4, > maintenance replicas: 0, live entering maintenance replicas: 0, excess > replicas: 0, Is Open File: false, Datanodes having this block: > 10.255.43.57:50010 10.255.53.12:50010 10.255.63.12:50010 10.255.62.39:50010 > 10.255.37.36:50010 10.255.33.15:50010 10.255.69.29:50010 10.255.51.13:50010 > 10.255.64.15:50010 , Current Datanode: 10.255.69.29:50010, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > 2019-10-22 15:58:51,514 DEBUG > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeAdminManager: Node > 10.255.69.29:50010 still has 1 blocks to replicate before it is a candidate > to finish Decommission In Progress > {quote} > After digging the source code and cluster log, guess it happens as follow > steps. > # Storage strategy is RS-6-3-1024k. > # EC block b consists of b0, b1, b2, b3, b4, b5, b6, b7, b8, b0 is from > datanode dn0, b1 is from datanode dn1, ...etc > # At the beginning dn0 is in decommission progress, b0 is replicated > successfully, and dn0 is staill in decommission progress. > # Later b1, b2, b3 in decommission progress, and dn4 containing b4 is out of > service, so need to reconstruct, and create ErasureCodingWork to do it, in > the ErasureCodingWork, additionalReplRequired is 4 > # Because hasAllInternalBlocks is false, Will call > ErasureCodingWork#addTaskToDatanode -> > DatanodeDescriptor#addBlockToBeErasureCoded, and send > BlockECReconstructionInfo task to Datanode > # DataNode can not reconstruction the block because targets is 4, greater > than 3( parity number). > There is a problem as follow, from BlockManager.java#scheduleReconstruction > {code} > // should reconstruct all the internal blocks before scheduling > // replication task for decommissioning node(s). > if (additionalReplRequired - numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas() > 0) { > additionalReplRequired = additionalReplRequired - > numReplicas.decommissioning() - > numReplicas.liveEnteringMaintenanceReplicas(); > } > {code} > Should reconstruction firstly and then replicate for decommissioning. Because > numReplicas.decommissioning() is 4, and additionalReplRequired is 4, that's > wrong, > numReplicas.decommissioning() should be 3, it should exclude live replica. > If so, additionalReplRequired will be 1, reconstruction will schedule as > expected. After that, decommission goes on. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org