I see that there was a thread on this in December, but I can't retrieve it to 
reply properly, oh well.

So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  Each 
is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks in JBOD. 
 Original system config was six 1TB data disks - we added the last four disks 
months later.  I'm sure you can all guess, we have some interesting internal 
usage balancing issues on most of the nodes.  To date, when individual disks 
get critically low on space (earlier this week I had a node with six disks 
around 97% full, four around 70%), we've been pulling them from the cluster, 
formatting the data disks, and sticking them back in (with a rebalance running 
to keep the cluster in some semblance of order).

Obviously if there was a better way to do this, I'd love to see it.  I see that 
there are recommendations of killing the DataNode process and manually moving 
files, but my concern is that the DataNode process will spend an enormous 
amount of time tracking down these moves (currently around 820,000 
blocks/node).  And it's not necessarily easy to automate, so there's the danger 
of nuking blocks, and making the problems worse.  Are there alternatives to 
manual moves (or more automated ways that exist)?  Or has my brute-force 
rebalance got the best chance of success, albeit slowly?

We are also building a new cluster - starting around 1.2PB raw, eventually 
growing to around 5PB, for near-line storage of data.  Our storage nodes will 
probably be 4U systems with 72 data disks each (yeah, good times).  The problem 
with this becomes obvious - with the way Hadoop works today, if a disk fails, 
the datanode process chokes and dies when it tries to write to it.  We've been 
told repeatedly that Hadoop doesn't perform well when it operates on RAID 
arrays, but, to scale efffectively, we're going to have to do just that - three 
24 disk controllers in RAID-6 mode.  How bad is this going to be?  JBOD just 
doesn't scale beyond a couple disks per machine, the failure rate will knock 
machines out of the cluster too often (and at 60TB per node, rebalancing will 
take forever, even if I let it saturate gigabit).

I appreciate opinions and suggestions.  Thanks!

-j

Reply via email to