[
https://issues.apache.org/jira/browse/HDFS-15945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17345812#comment-17345812
]
Takanobu Asanuma commented on HDFS-15945:
-----------------------------------------
We found the root cause of this problem. There is a bug in hadoop-3.3.0 that
DataNode doesn't shut down even if the number of the failed volumes is greater
than dfs.datanode.failed.volumes.tolerated. Therefore, the capacity of a
DataNode can be zero. Recently, the bug is solved by HDFS-15963. After
HDFS-15963, the capacity of DataNode can't be 0 because the DataNode becomes
shut down before the capacity is 0. So this jira is not a problem.
> DataNodes with zero capacity and zero blocks should be decommissioned
> immediately
> ---------------------------------------------------------------------------------
>
> Key: HDFS-15945
> URL: https://issues.apache.org/jira/browse/HDFS-15945
> Project: Hadoop HDFS
> Issue Type: Bug
> Reporter: Takanobu Asanuma
> Assignee: Takanobu Asanuma
> Priority: Major
> Labels: pull-request-available
> Time Spent: 3h 20m
> Remaining Estimate: 0h
>
> Such as when there is a storage problem, DataNode capacity and block count
> sometimes become zero.
> When we tried to decommission those DataNodes, we ran into an issue that the
> decommission did not complete because the NameNode had not received their
> first block report.
> {noformat}
> INFO blockmanagement.DatanodeAdminManager
> (DatanodeAdminManager.java:startDecommission(183)) - Starting decommission of
> 127.0.0.1:58343
> [DISK]DS-a29de094-2b19-4834-8318-76cda3bd86bf:NORMAL:127.0.0.1:58343 with 0
> blocks
> INFO blockmanagement.BlockManager
> (BlockManager.java:isNodeHealthyForDecommissionOrMaintenance(4587)) - Node
> 127.0.0.1:58343 hasn't sent its first block report.
> INFO blockmanagement.DatanodeAdminDefaultMonitor
> (DatanodeAdminDefaultMonitor.java:check(258)) - Node 127.0.0.1:58343 isn't
> healthy. It needs to replicate 0 more blocks. Decommission In Progress is
> still in progress.
> {noformat}
> To make matters worse, even if we stopped these DataNodes afterward, they
> remained in a dead&decommissioning state until NameNode restarted.
> I think those DataNodes should be decommissioned immediately even if NameNode
> hasn't recived the first block report.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]