[jira] [Commented] (HDFS-15945) DataNodes with zero capacity and zero blocks should be decommissioned immediately

Takanobu Asanuma (Jira) Sun, 16 May 2021 18:11:07 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345812#comment-17345812
 ]


Takanobu Asanuma commented on HDFS-15945:
-----------------------------------------

We found the root cause of this problem. There is a bug in hadoop-3.3.0 that 
DataNode doesn't shut down even if the number of the failed volumes is greater 
than dfs.datanode.failed.volumes.tolerated. Therefore, the capacity of a 
DataNode can be zero. Recently, the bug is solved by HDFS-15963. After 
HDFS-15963, the capacity of DataNode can't be 0 because the DataNode becomes 
shut down before the capacity is 0. So this jira is not a problem.

> DataNodes with zero capacity and zero blocks should be decommissioned 
> immediately
> ---------------------------------------------------------------------------------
>
>                 Key: HDFS-15945
>                 URL: https://issues.apache.org/jira/browse/HDFS-15945
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Takanobu Asanuma
>            Assignee: Takanobu Asanuma
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> Such as when there is a storage problem, DataNode capacity and block count 
> sometimes become zero.
>  When we tried to decommission those DataNodes, we ran into an issue that the 
> decommission did not complete because the NameNode had not received their 
> first block report.
> {noformat}
> INFO  blockmanagement.DatanodeAdminManager 
> (DatanodeAdminManager.java:startDecommission(183)) - Starting decommission of 
> 127.0.0.1:58343 
> [DISK]DS-a29de094-2b19-4834-8318-76cda3bd86bf:NORMAL:127.0.0.1:58343 with 0 
> blocks
> INFO  blockmanagement.BlockManager 
> (BlockManager.java:isNodeHealthyForDecommissionOrMaintenance(4587)) - Node 
> 127.0.0.1:58343 hasn't sent its first block report.
> INFO  blockmanagement.DatanodeAdminDefaultMonitor 
> (DatanodeAdminDefaultMonitor.java:check(258)) - Node 127.0.0.1:58343 isn't 
> healthy. It needs to replicate 0 more blocks. Decommission In Progress is 
> still in progress.
> {noformat}
> To make matters worse, even if we stopped these DataNodes afterward, they 
> remained in a dead&decommissioning state until NameNode restarted.
> I think those DataNodes should be decommissioned immediately even if NameNode 
> hasn't recived the first block report.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-15945) DataNodes with zero capacity and zero blocks should be decommissioned immediately

Reply via email to