BUG? dfs.namenode.replication.considerLoad and decommissioned nodes

Bryan Beaudreault Thu, 16 Jan 2014 14:39:04 -0800

Running CDH4.2 version of HDFS, I may have found a bug in
the dfs.namenode.replication.considerLoad feature. I would like to query
here before entering a JIRA (a quick search was fruitless).


I have an HBase cluster in which I recently replaced 70 smaller servers
with 22 larger ones.  I added the 22 to the cluster, moved all of the HBase
regions to the new servers, major compacted to re-write locally, then used
the HDFS decommission to decommission the 65 smaller servers.

This all worked well, and HBase was happy.

However, later on, after the decommission finished, I tried to write a file
to HDFS from a node that does NOT have a DataNode running on it (HMaster).
 These operations failed because all 92 servers were being set to excluded.
 https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1 for logs.

Reading through the code, I found that the DefaultBlockPlacementPolicy
calculates the load average of the cluster by doing:  TotalClusterLoad /
numNodes.  However, numNodes includes decommissioned nodes (which have 0
load).  Therefore, the average load is artificially low.  Example:

TotalLoad = 250
numNodes = 92
decommissionedNodes = 70

avgLoad = 250/92 = 2.71
trueAvgLoad = 250 / (92 - 70) = 11.36

Because of this math, all of our remaining 22 nodes were considered
"overloaded", as they were all more than 2x 2.71.  That, with the
decommissioned nodes already excluded, results in all servers being
excluded.

(Looking at logs of my regionservers later I did see that a bunch of writes
were not able to reach their required replication factor as well, though
did not fail this spectacularly).

Is this a bug or expected behavior?

BUG? dfs.namenode.replication.considerLoad and decommissioned nodes

Reply via email to