[ https://issues.apache.org/jira/browse/HDFS-17722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Sigoure updated HDFS-17722: ---------------------------------- Description: When decommissioning a DataNode in our cluster, we observed a situation where the active NameNode had marked the DataNode as decommissioned but the standby had it stuck in decommissioning state indefinitely (we waited 8h) due to a block being allegedly under replicated (note: for this path the target replication factor is 2x). The standby NameNode kept logging this in a loop: {{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: false, Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false}} Looking at the fsck report for this block, the active NameNode was reporting the following: {code:java} Block Id: blk_1486338012 Block belongs to: /path/to/file No. of Expected Replica: 2 No. of live Replica: 2 No. of excess Replica: 0 No. of stale Replica: 0 No. of decommissioned Replica: 1 No. of decommissioning Replica: 0 No. of corrupted Replica: 0 Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is HEALTHY Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is DECOMMISSIONED Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is HEALTHY {code} Whereas on the standby it says: {code:java} Block Id: blk_1486338012 Block belongs to: /path/to/file No. of Expected Replica: 2 No. of live Replica: 1 No. of excess Replica: 1 No. of stale Replica: 0 No. of decommissioned Replica: 0 No. of decommissioning Replica: 1 No. of corrupted Replica: 0 Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is HEALTHY Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is DECOMMISSIONING Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is HEALTHY {code} {code:java} hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file -rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file {code} After restarting the standby NameNode, the problem disappeared, the datanode in question transitioned to decommissioned state as expected. Credits for the bug report go to Tomas Baltrunas at Arista. was: When decommissioning a DataNode in our cluster, we observed a situation where the active NameNode had marked the DataNode as decommissioned but the standby had it stuck in decommissioning state indefinitely (we waited 8h) due to a block being allegedly under replicated (note: for this path the target replication factor is 2x). The standby NameNode kept logging this in a loop: {{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: false, Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false}} Looking at the fsck report for this block, the active NameNode was reporting the following: {code:java} Block Id: blk_1486338012 Block belongs to: /path/to/file No. of Expected Replica: 2 No. of live Replica: 2 No. of excess Replica: 0 No. of stale Replica: 0 No. of decommissioned Replica: 1 No. of decommissioning Replica: 0 No. of corrupted Replica: 0 Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is HEALTHY Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is DECOMMISSIONED Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is HEALTHY {code} Whereas on the standby it says: {code:java} Block Id: blk_1486338012 Block belongs to: /path/to/file No. of Expected Replica: 2 No. of live Replica: 1 No. of excess Replica: 1 No. of stale Replica: 0 No. of decommissioned Replica: 0 No. of decommissioning Replica: 1 No. of corrupted Replica: 0 Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is HEALTHY Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is DECOMMISSIONING Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is HEALTHY {code} {code:java} hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file -rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file {code} After restarting the standby NameNode, the problem disappeared, the datanode in question transitioned to decommissioned state as expected. > DataNode stuck in decommissioning on standby NameNode > ----------------------------------------------------- > > Key: HDFS-17722 > URL: https://issues.apache.org/jira/browse/HDFS-17722 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 3.3.6 > Reporter: Benoit Sigoure > Priority: Minor > > When decommissioning a DataNode in our cluster, we observed a situation where > the active NameNode had marked the DataNode as decommissioned but the standby > had it stuck in decommissioning state indefinitely (we waited 8h) due to a > block being allegedly under replicated (note: for this path the target > replication factor is 2x). The standby NameNode kept logging this in a loop: > {{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: > blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt > replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, > maintenance replicas: 0, live entering maintenance replicas: 0, replicas on > stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: > false, Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 > 10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode > decommissioning: true, Is current datanode entering maintenance: false}} > Looking at the fsck report for this block, the active NameNode was reporting > the following: > {code:java} > Block Id: blk_1486338012 > Block belongs to: /path/to/file > No. of Expected Replica: 2 > No. of live Replica: 2 > No. of excess Replica: 0 > No. of stale Replica: 0 > No. of decommissioned Replica: 1 > No. of decommissioning Replica: 0 > No. of corrupted Replica: 0 > Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is > HEALTHY > Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is > DECOMMISSIONED > Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is > HEALTHY > {code} > Whereas on the standby it says: > {code:java} > Block Id: blk_1486338012 > Block belongs to: /path/to/file > No. of Expected Replica: 2 > No. of live Replica: 1 > No. of excess Replica: 1 > No. of stale Replica: 0 > No. of decommissioned Replica: 0 > No. of decommissioning Replica: 1 > No. of corrupted Replica: 0 > Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is > HEALTHY > Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is > DECOMMISSIONING > Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is > HEALTHY > {code} > {code:java} > hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file > -rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file > {code} > After restarting the standby NameNode, the problem disappeared, the datanode > in question transitioned to decommissioned state as expected. > Credits for the bug report go to Tomas Baltrunas at Arista. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org