I just added 10 datanodes to a small cluster and turned up the replication on many of the files to balance the storage out a bit.
I expected to see a uniform-ish distribution of blocks on the new nodes. This is what I got instead: Node Last Contact State Size (GB) Used (%) Blocks hadoop1 0 In Service 42.68 72.36 585 hadoop10 1 In Service 42.68 50.30 354 hadoop11 2 In Service 42.68 48.02 340 hadoop2 2 In Service 42.68 73.01 597 hadoop3 2 In Service 42.68 72.68 614 hadoop6 0 In Service 42.68 72.87 578 hadoop7 0 In Service 42.68 72.38 600 hadoop8 2 In Service 42.68 72.30 593 hadoop9 2 In Service 42.68 72.70 637 metricsapp1 0 In Service 257.98 90.52 4134 metricsapp2 0 In Service 257.98 40.23 2338 metricsapp3 2 In Service 247.20 39.41 2889 metricsapp4 2 In Service 257.98 98.44 5096 The right-most column is what we are interested in here. Note how hadoop10 and hadoop11 have significantly fewer blocks than the others. Statistically we should expect that the counts should vary less than about 2 * sqrt(600) = 50 Indeed, most of them do. But those two do not. Is there some hidden significance in the names of nodes?
