The problem is, what do you define as a failure? If the disk is failing, writes will fail to the filesystem - how does Hadoop differentiate between permissions and physical disk failure? They both return error.
And yeah, the idea of stopping the datanode, removing the affected mount from hdfs-site.xml, and restarting has been discussed. The problem is, when that disk gets replaced, and readded, then I have horrible internal balance issues. Thus causing the problem I have now :( -j On Jan 3, 2011, at 9:07 AM, Eli Collins wrote: > Hey Jonathan, > > There's an option (dfs.datanode.failed.volumes.tolerated, introduced > in HDFS-1161) that allows you to specify the number of volumes that > are allowed to fail before a datanode stops offering service. > > There's an operational issue that still needs to be addressed > (HDFS-1158) that you should be aware of - the DN will still not start > if any of the volumes have failed, so to restart the DN you'll need > you'll need to either unconfigure the failed volumes or fix them. I'd > like to make DN startup respect the config value so it tolerates > failed volumes on startup as well. > > Thanks, > Eli > > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdis...@parad.net> wrote: >> I see that there was a thread on this in December, but I can't retrieve it >> to reply properly, oh well. >> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc). >> Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks >> in JBOD. Original system config was six 1TB data disks - we added the last >> four disks months later. I'm sure you can all guess, we have some >> interesting internal usage balancing issues on most of the nodes. To date, >> when individual disks get critically low on space (earlier this week I had a >> node with six disks around 97% full, four around 70%), we've been pulling >> them from the cluster, formatting the data disks, and sticking them back in >> (with a rebalance running to keep the cluster in some semblance of order). >> >> Obviously if there was a better way to do this, I'd love to see it. I see >> that there are recommendations of killing the DataNode process and manually >> moving files, but my concern is that the DataNode process will spend an >> enormous amount of time tracking down these moves (currently around 820,000 >> blocks/node). And it's not necessarily easy to automate, so there's the >> danger of nuking blocks, and making the problems worse. Are there >> alternatives to manual moves (or more automated ways that exist)? Or has my >> brute-force rebalance got the best chance of success, albeit slowly? >> >> We are also building a new cluster - starting around 1.2PB raw, eventually >> growing to around 5PB, for near-line storage of data. Our storage nodes >> will probably be 4U systems with 72 data disks each (yeah, good times). The >> problem with this becomes obvious - with the way Hadoop works today, if a >> disk fails, the datanode process chokes and dies when it tries to write to >> it. We've been told repeatedly that Hadoop doesn't perform well when it >> operates on RAID arrays, but, to scale efffectively, we're going to have to >> do just that - three 24 disk controllers in RAID-6 mode. How bad is this >> going to be? JBOD just doesn't scale beyond a couple disks per machine, the >> failure rate will knock machines out of the cluster too often (and at 60TB >> per node, rebalancing will take forever, even if I let it saturate gigabit). >> >> I appreciate opinions and suggestions. Thanks! >> >> -j