[ 
https://issues.apache.org/jira/browse/HDFS-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kuhu Shukla resolved HDFS-9287.
-------------------------------
       Resolution: Duplicate
    Fix Version/s: 2.8.0

HDFS-7725 fixes this issue. Verified through a unit test.

> Block placement completely fails if too many nodes are decommissioning
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9287
>                 URL: https://issues.apache.org/jira/browse/HDFS-9287
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Kuhu Shukla
>            Priority: Critical
>             Fix For: 2.8.0
>
>
> The DatanodeManager coordinates with the HeartbeatManager to update 
> HeartbeatManager.Stats to track capacity and load.   This is crucial for 
> block placement to consider space and load.  It's completely broken for 
> decomm nodes.
> The heartbeat manager substracts the prior values before it adds new values.  
> During registration of a decomm node, it substracts before seeding the 
> initial values.  This decrements nodesInService, flips state to decomm, add 
> will not increment nodesInService (correct).  There are other math bugs 
> (double adding) that accidentally work due to 0 values.
> The result is every decomm node decrements the node count used for block 
> placement.  When enough nodes are decomm, the replication monitor will 
> silently stop working.  No logging.  It searches all nodes and just gives up. 
>  Eventually, all block allocation will also completely fail.  No files can be 
> created.  No jobs can be submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to