[jira] [Updated] (HDFS-17722) DataNode stuck in decommissioning on standby NameNode

Benoit Sigoure (Jira) Fri, 31 Jan 2025 11:05:42 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-17722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benoit Sigoure updated HDFS-17722:
----------------------------------
    Description: 
When decommissioning a DataNode in our cluster, we observed a situation where 
the active NameNode had marked the DataNode as decommissioned but the standby 
had it stuck in decommissioning state indefinitely (we waited 8h) due to a 
block being allegedly under replicated (note: for this path the target 
replication factor is 2x).  The standby NameNode kept logging this in a loop:

{{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: 
blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt 
replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: false, 
Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 
10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode 
decommissioning: true, Is current datanode entering maintenance: false}}

Looking at the fsck report for this block, the active NameNode was reporting 
the following:
{code:java}
Block Id: blk_1486338012
Block belongs to: /path/to/file
No. of Expected Replica: 2
No. of live Replica: 2
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommissioned Replica: 1
No. of decommissioning Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
HEALTHY
Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
DECOMMISSIONED
Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
HEALTHY
{code}
Whereas on the standby it says:
{code:java}
Block Id: blk_1486338012
Block belongs to: /path/to/file
No. of Expected Replica: 2
No. of live Replica: 1
No. of excess Replica: 1
No. of stale Replica: 0
No. of decommissioned Replica: 0
No. of decommissioning Replica: 1
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
HEALTHY
Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
DECOMMISSIONING
Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
HEALTHY
{code}
{code:java}
hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file
-rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file
{code}
After restarting the standby NameNode, the problem disappeared, the datanode in 
question transitioned to decommissioned state as expected.

Credits for the bug report go to Tomas Baltrunas at Arista.

  was:
When decommissioning a DataNode in our cluster, we observed a situation where 
the active NameNode had marked the DataNode as decommissioned but the standby 
had it stuck in decommissioning state indefinitely (we waited 8h) due to a 
block being allegedly under replicated (note: for this path the target 
replication factor is 2x).  The standby NameNode kept logging this in a loop:

{{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: 
blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt 
replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: false, 
Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 
10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode 
decommissioning: true, Is current datanode entering maintenance: false}}

Looking at the fsck report for this block, the active NameNode was reporting 
the following:
{code:java}
Block Id: blk_1486338012
Block belongs to: /path/to/file
No. of Expected Replica: 2
No. of live Replica: 2
No. of excess Replica: 0
No. of stale Replica: 0
No. of decommissioned Replica: 1
No. of decommissioning Replica: 0
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
HEALTHY
Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
DECOMMISSIONED
Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
HEALTHY
{code}
Whereas on the standby it says:
{code:java}
Block Id: blk_1486338012
Block belongs to: /path/to/file
No. of Expected Replica: 2
No. of live Replica: 1
No. of excess Replica: 1
No. of stale Replica: 0
No. of decommissioned Replica: 0
No. of decommissioning Replica: 1
No. of corrupted Replica: 0
Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
HEALTHY
Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
DECOMMISSIONING
Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
HEALTHY
{code}
{code:java}
hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file
-rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file
{code}
After restarting the standby NameNode, the problem disappeared, the datanode in 
question transitioned to decommissioned state as expected.


> DataNode stuck in decommissioning on standby NameNode
> -----------------------------------------------------
>
>                 Key: HDFS-17722
>                 URL: https://issues.apache.org/jira/browse/HDFS-17722
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.3.6
>            Reporter: Benoit Sigoure
>            Priority: Minor
>
> When decommissioning a DataNode in our cluster, we observed a situation where 
> the active NameNode had marked the DataNode as decommissioned but the standby 
> had it stuck in decommissioning state indefinitely (we waited 8h) due to a 
> block being allegedly under replicated (note: for this path the target 
> replication factor is 2x).  The standby NameNode kept logging this in a loop:
> {{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: 
> blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt 
> replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
> maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
> stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: 
> false, Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 
> 10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode 
> decommissioning: true, Is current datanode entering maintenance: false}}
> Looking at the fsck report for this block, the active NameNode was reporting 
> the following:
> {code:java}
> Block Id: blk_1486338012
> Block belongs to: /path/to/file
> No. of Expected Replica: 2
> No. of live Replica: 2
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 1
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
> HEALTHY
> Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
> DECOMMISSIONED
> Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
> HEALTHY
> {code}
> Whereas on the standby it says:
> {code:java}
> Block Id: blk_1486338012
> Block belongs to: /path/to/file
> No. of Expected Replica: 2
> No. of live Replica: 1
> No. of excess Replica: 1
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 1
> No. of corrupted Replica: 0
> Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
> HEALTHY
> Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
> DECOMMISSIONING
> Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
> HEALTHY
> {code}
> {code:java}
> hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file
> -rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file
> {code}
> After restarting the standby NameNode, the problem disappeared, the datanode 
> in question transitioned to decommissioned state as expected.
> Credits for the bug report go to Tomas Baltrunas at Arista.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-17722) DataNode stuck in decommissioning on standby NameNode

Reply via email to