Running CDH4.2 version of HDFS, I may have found a bug in the dfs.namenode.replication.considerLoad feature. I would like to query here before entering a JIRA (a quick search was fruitless).
I have an HBase cluster in which I recently replaced 70 smaller servers with 22 larger ones. I added the 22 to the cluster, moved all of the HBase regions to the new servers, major compacted to re-write locally, then used the HDFS decommission to decommission the 65 smaller servers. This all worked well, and HBase was happy. However, later on, after the decommission finished, I tried to write a file to HDFS from a node that does NOT have a DataNode running on it (HMaster). These operations failed because all 92 servers were being set to excluded. https://gist.github.com/bbeaudreault/49c8aa4bb231de54e9c1 for logs. Reading through the code, I found that the DefaultBlockPlacementPolicy calculates the load average of the cluster by doing: TotalClusterLoad / numNodes. However, numNodes includes decommissioned nodes (which have 0 load). Therefore, the average load is artificially low. Example: TotalLoad = 250 numNodes = 92 decommissionedNodes = 70 avgLoad = 250/92 = 2.71 trueAvgLoad = 250 / (92 - 70) = 11.36 Because of this math, all of our remaining 22 nodes were considered "overloaded", as they were all more than 2x 2.71. That, with the decommissioned nodes already excluded, results in all servers being excluded. (Looking at logs of my regionservers later I did see that a bunch of writes were not able to reach their required replication factor as well, though did not fail this spectacularly). Is this a bug or expected behavior?