On Mon, Jan 3, 2011 at 11:55 AM, Jonathan Disher <jdis...@parad.net> wrote: > The problem is, what do you define as a failure? If the disk is failing, > writes will fail to the filesystem - how does Hadoop differentiate between > permissions and physical disk failure? They both return error. >
Anything that prevents the volume (mount) from being read or written. Any failure to write to the volume is considered a failure to use the volume. Since HDFS doesn't support RO volumes (eg it can't handle a mount that can be read but not written) these all count as failures and will cause the volume to taken offline. > And yeah, the idea of stopping the datanode, removing the affected mount from > hdfs-site.xml, and restarting has been discussed. The problem is, when that > disk gets replaced, and readded, then I have horrible internal balance > issues. Thus causing the problem I have now :( What's the particular issue? Having an unbalanced set of local disks should at worst be a performance problem. HDFS doesn't write blocks to full volumes, it will just start using the other disks. Thanks, Eli > -j > > On Jan 3, 2011, at 9:07 AM, Eli Collins wrote: > >> Hey Jonathan, >> >> There's an option (dfs.datanode.failed.volumes.tolerated, introduced >> in HDFS-1161) that allows you to specify the number of volumes that >> are allowed to fail before a datanode stops offering service. >> >> There's an operational issue that still needs to be addressed >> (HDFS-1158) that you should be aware of - the DN will still not start >> if any of the volumes have failed, so to restart the DN you'll need >> you'll need to either unconfigure the failed volumes or fix them. I'd >> like to make DN startup respect the config value so it tolerates >> failed volumes on startup as well. >> >> Thanks, >> Eli >> >> On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdis...@parad.net> wrote: >>> I see that there was a thread on this in December, but I can't retrieve it >>> to reply properly, oh well. >>> >>> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc). >>> Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks >>> in JBOD. Original system config was six 1TB data disks - we added the last >>> four disks months later. I'm sure you can all guess, we have some >>> interesting internal usage balancing issues on most of the nodes. To date, >>> when individual disks get critically low on space (earlier this week I had >>> a node with six disks around 97% full, four around 70%), we've been pulling >>> them from the cluster, formatting the data disks, and sticking them back in >>> (with a rebalance running to keep the cluster in some semblance of order). >>> >>> Obviously if there was a better way to do this, I'd love to see it. I see >>> that there are recommendations of killing the DataNode process and manually >>> moving files, but my concern is that the DataNode process will spend an >>> enormous amount of time tracking down these moves (currently around 820,000 >>> blocks/node). And it's not necessarily easy to automate, so there's the >>> danger of nuking blocks, and making the problems worse. Are there >>> alternatives to manual moves (or more automated ways that exist)? Or has >>> my brute-force rebalance got the best chance of success, albeit slowly? >>> >>> We are also building a new cluster - starting around 1.2PB raw, eventually >>> growing to around 5PB, for near-line storage of data. Our storage nodes >>> will probably be 4U systems with 72 data disks each (yeah, good times). >>> The problem with this becomes obvious - with the way Hadoop works today, if >>> a disk fails, the datanode process chokes and dies when it tries to write >>> to it. We've been told repeatedly that Hadoop doesn't perform well when it >>> operates on RAID arrays, but, to scale efffectively, we're going to have to >>> do just that - three 24 disk controllers in RAID-6 mode. How bad is this >>> going to be? JBOD just doesn't scale beyond a couple disks per machine, >>> the failure rate will knock machines out of the cluster too often (and at >>> 60TB per node, rebalancing will take forever, even if I let it saturate >>> gigabit). >>> >>> I appreciate opinions and suggestions. Thanks! >>> >>> -j > >