[ 
https://issues.apache.org/jira/browse/HDFS-9287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-9287:
-----------------------------
    Fix Version/s:     (was: 2.8.0)

> Block placement completely fails if too many nodes are decommissioning
> ----------------------------------------------------------------------
>
>                 Key: HDFS-9287
>                 URL: https://issues.apache.org/jira/browse/HDFS-9287
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Kuhu Shukla
>            Priority: Critical
>
> The DatanodeManager coordinates with the HeartbeatManager to update 
> HeartbeatManager.Stats to track capacity and load.   This is crucial for 
> block placement to consider space and load.  It's completely broken for 
> decomm nodes.
> The heartbeat manager substracts the prior values before it adds new values.  
> During registration of a decomm node, it substracts before seeding the 
> initial values.  This decrements nodesInService, flips state to decomm, add 
> will not increment nodesInService (correct).  There are other math bugs 
> (double adding) that accidentally work due to 0 values.
> The result is every decomm node decrements the node count used for block 
> placement.  When enough nodes are decomm, the replication monitor will 
> silently stop working.  No logging.  It searches all nodes and just gives up. 
>  Eventually, all block allocation will also completely fail.  No files can be 
> created.  No jobs can be submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to