[ https://issues.apache.org/jira/browse/HDFS-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kuhu Shukla resolved HDFS-9287. ------------------------------- Resolution: Duplicate Fix Version/s: 2.8.0 HDFS-7725 fixes this issue. Verified through a unit test. > Block placement completely fails if too many nodes are decommissioning > ---------------------------------------------------------------------- > > Key: HDFS-9287 > URL: https://issues.apache.org/jira/browse/HDFS-9287 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.0 > Reporter: Daryn Sharp > Assignee: Kuhu Shukla > Priority: Critical > Fix For: 2.8.0 > > > The DatanodeManager coordinates with the HeartbeatManager to update > HeartbeatManager.Stats to track capacity and load. This is crucial for > block placement to consider space and load. It's completely broken for > decomm nodes. > The heartbeat manager substracts the prior values before it adds new values. > During registration of a decomm node, it substracts before seeding the > initial values. This decrements nodesInService, flips state to decomm, add > will not increment nodesInService (correct). There are other math bugs > (double adding) that accidentally work due to 0 values. > The result is every decomm node decrements the node count used for block > placement. When enough nodes are decomm, the replication monitor will > silently stop working. No logging. It searches all nodes and just gives up. > Eventually, all block allocation will also completely fail. No files can be > created. No jobs can be submitted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)