Statistically bad distribution of blocks

Ted Dunning Wed, 19 Sep 2007 19:46:52 -0700

I just added 10 datanodes to a small cluster and turned up the replication
on many of the files to balance the storage out a bit.


I expected to see a uniform-ish distribution of blocks on the new nodes.
This is what I got instead:

 Node    Last Contact State Size (GB)      Used (%)  Blocks
hadoop1        0    In Service     42.68    72.36    585
hadoop10       1    In Service     42.68    50.30    354
hadoop11       2    In Service     42.68    48.02    340
hadoop2        2    In Service     42.68    73.01    597
hadoop3        2    In Service     42.68    72.68    614
hadoop6        0    In Service     42.68    72.87    578
hadoop7        0    In Service     42.68    72.38    600
hadoop8        2    In Service     42.68    72.30    593
hadoop9        2    In Service     42.68    72.70    637
metricsapp1    0    In Service    257.98    90.52    4134
metricsapp2    0    In Service    257.98    40.23    2338
metricsapp3    2    In Service    247.20    39.41    2889
metricsapp4    2    In Service    257.98    98.44    5096

The right-most column is what we are interested in here.  Note how hadoop10
and hadoop11 have significantly fewer blocks than the others.  Statistically
we should expect that the counts should vary less than about

  2 * sqrt(600) = 50

Indeed, most of them do.  But those two do not.

Is there some hidden significance in the names of nodes?

Statistically bad distribution of blocks

Reply via email to