That's what we've been doing. Again, the problem is, we still have to pull the datanode out of rotation and change config, replace disk, put it back... even if I have spares on hand and finish this in a few minutes, I still have one empty disk and many tens of not-empty disks. Monitoring and identifying the failure isn't the problem, we have that down pat. I'm hoping for a better way to re-balance the disks in the node after a failure. I suspect the sad answer is that what I'm doing now is the best thing for it.
-j On Jan 3, 2011, at 10:21 PM, Esteban Gutierrez Moguel wrote: > > Jonathan, > > Hadoop will throw an exception according to the kind of error: > AccessControlException if its permission related or IOException for any other > disk related task. > > A safer approach to handle physical failures would be monitoring syslog > messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and > the node doesn't hangs after the disk failure, you could shutdown it > gracefully. > > esteban. > > On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jdis...@parad.net> wrote: > The problem is, what do you define as a failure? If the disk is failing, > writes will fail to the filesystem - how does Hadoop differentiate between > permissions and physical disk failure? They both return error. > > And yeah, the idea of stopping the datanode, removing the affected mount from > hdfs-site.xml, and restarting has been discussed. The problem is, when that > disk gets replaced, and readded, then I have horrible internal balance > issues. Thus causing the problem I have now :( > > -j > > On Jan 3, 2011, at 9:07 AM, Eli Collins wrote: > > > Hey Jonathan, > > > > There's an option (dfs.datanode.failed.volumes.tolerated, introduced > > in HDFS-1161) that allows you to specify the number of volumes that > > are allowed to fail before a datanode stops offering service. > > > > There's an operational issue that still needs to be addressed > > (HDFS-1158) that you should be aware of - the DN will still not start > > if any of the volumes have failed, so to restart the DN you'll need > > you'll need to either unconfigure the failed volumes or fix them. I'd > > like to make DN startup respect the config value so it tolerates > > failed volumes on startup as well. > > > > Thanks, > > Eli > > > > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdis...@parad.net> wrote: > >> I see that there was a thread on this in December, but I can't retrieve it > >> to reply properly, oh well. > >> > >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc). > >> Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data > >> disks in JBOD. Original system config was six 1TB data disks - we added > >> the last four disks months later. I'm sure you can all guess, we have > >> some interesting internal usage balancing issues on most of the nodes. To > >> date, when individual disks get critically low on space (earlier this week > >> I had a node with six disks around 97% full, four around 70%), we've been > >> pulling them from the cluster, formatting the data disks, and sticking > >> them back in (with a rebalance running to keep the cluster in some > >> semblance of order). > >> > >> Obviously if there was a better way to do this, I'd love to see it. I see > >> that there are recommendations of killing the DataNode process and > >> manually moving files, but my concern is that the DataNode process will > >> spend an enormous amount of time tracking down these moves (currently > >> around 820,000 blocks/node). And it's not necessarily easy to automate, > >> so there's the danger of nuking blocks, and making the problems worse. > >> Are there alternatives to manual moves (or more automated ways that > >> exist)? Or has my brute-force rebalance got the best chance of success, > >> albeit slowly? > >> > >> We are also building a new cluster - starting around 1.2PB raw, eventually > >> growing to around 5PB, for near-line storage of data. Our storage nodes > >> will probably be 4U systems with 72 data disks each (yeah, good times). > >> The problem with this becomes obvious - with the way Hadoop works today, > >> if a disk fails, the datanode process chokes and dies when it tries to > >> write to it. We've been told repeatedly that Hadoop doesn't perform well > >> when it operates on RAID arrays, but, to scale efffectively, we're going > >> to have to do just that - three 24 disk controllers in RAID-6 mode. How > >> bad is this going to be? JBOD just doesn't scale beyond a couple disks > >> per machine, the failure rate will knock machines out of the cluster too > >> often (and at 60TB per node, rebalancing will take forever, even if I let > >> it saturate gigabit). > >> > >> I appreciate opinions and suggestions. Thanks! > >> > >> -j > >