[
https://issues.apache.org/jira/browse/HDFS-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kuhu Shukla reassigned HDFS-9287:
---------------------------------
Assignee: Kuhu Shukla
> Block placement completely fails if too many nodes are decommissioning
> ----------------------------------------------------------------------
>
> Key: HDFS-9287
> URL: https://issues.apache.org/jira/browse/HDFS-9287
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.0
> Reporter: Daryn Sharp
> Assignee: Kuhu Shukla
> Priority: Critical
>
> The DatanodeManager coordinates with the HeartbeatManager to update
> HeartbeatManager.Stats to track capacity and load. This is crucial for
> block placement to consider space and load. It's completely broken for
> decomm nodes.
> The heartbeat manager substracts the prior values before it adds new values.
> During registration of a decomm node, it substracts before seeding the
> initial values. This decrements nodesInService, flips state to decomm, add
> will not increment nodesInService (correct). There are other math bugs
> (double adding) that accidentally work due to 0 values.
> The result is every decomm node decrements the node count used for block
> placement. When enough nodes are decomm, the replication monitor will
> silently stop working. No logging. It searches all nodes and just gives up.
> Eventually, all block allocation will also completely fail. No files can be
> created. No jobs can be submitted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)