On Mon, Jan 3, 2011 at 10:29 PM, Jonathan Disher <jdis...@parad.net> wrote: > That's what we've been doing. Again, the problem is, we still have to pull > the datanode out of rotation and change config, replace disk, put it back... > even if I have spares on hand and finish this in a few minutes, I still have > one empty disk and many tens of not-empty disks.
Aside from performance is there another issue? Ideally of course the new disks would automatically get re-balanced, and you could rate-limit the transfers to limit the impact on the machine. > Monitoring and identifying > the failure isn't the problem, we have that down pat. I'm hoping for a > better way to re-balance the disks in the node after a failure. I suspect > the sad answer is that what I'm doing now is the best thing for it. HDFS-1312 tracks re-balancing disks within a datanode. Currently people re-balance the directories manually when the datanode is powered off (datanodes don't care which blocks reside in which volumes so you can safely rebalance by hand). Thanks, Eli > -j > On Jan 3, 2011, at 10:21 PM, Esteban Gutierrez Moguel wrote: > > Jonathan, > Hadoop will throw an exception according to the kind of error: > AccessControlException if its permission related or IOException for any > other disk related task. > A safer approach to handle physical failures would be monitoring syslog > messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and > the node doesn't hangs after the disk failure, you could shutdown it > gracefully. > esteban. > On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jdis...@parad.net> wrote: >> >> The problem is, what do you define as a failure? If the disk is failing, >> writes will fail to the filesystem - how does Hadoop differentiate between >> permissions and physical disk failure? They both return error. >> >> And yeah, the idea of stopping the datanode, removing the affected mount >> from hdfs-site.xml, and restarting has been discussed. The problem is, when >> that disk gets replaced, and readded, then I have horrible internal balance >> issues. Thus causing the problem I have now :( >> >> -j >> >> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote: >> >> > Hey Jonathan, >> > >> > There's an option (dfs.datanode.failed.volumes.tolerated, introduced >> > in HDFS-1161) that allows you to specify the number of volumes that >> > are allowed to fail before a datanode stops offering service. >> > >> > There's an operational issue that still needs to be addressed >> > (HDFS-1158) that you should be aware of - the DN will still not start >> > if any of the volumes have failed, so to restart the DN you'll need >> > you'll need to either unconfigure the failed volumes or fix them. I'd >> > like to make DN startup respect the config value so it tolerates >> > failed volumes on startup as well. >> > >> > Thanks, >> > Eli >> > >> > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdis...@parad.net> >> > wrote: >> >> I see that there was a thread on this in December, but I can't retrieve >> >> it to reply properly, oh well. >> >> >> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc). >> >> Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data >> >> disks >> >> in JBOD. Original system config was six 1TB data disks - we added the >> >> last >> >> four disks months later. I'm sure you can all guess, we have some >> >> interesting internal usage balancing issues on most of the nodes. To >> >> date, >> >> when individual disks get critically low on space (earlier this week I >> >> had a >> >> node with six disks around 97% full, four around 70%), we've been pulling >> >> them from the cluster, formatting the data disks, and sticking them back >> >> in >> >> (with a rebalance running to keep the cluster in some semblance of order). >> >> >> >> Obviously if there was a better way to do this, I'd love to see it. I >> >> see that there are recommendations of killing the DataNode process and >> >> manually moving files, but my concern is that the DataNode process will >> >> spend an enormous amount of time tracking down these moves (currently >> >> around >> >> 820,000 blocks/node). And it's not necessarily easy to automate, so >> >> there's >> >> the danger of nuking blocks, and making the problems worse. Are there >> >> alternatives to manual moves (or more automated ways that exist)? Or has >> >> my >> >> brute-force rebalance got the best chance of success, albeit slowly? >> >> >> >> We are also building a new cluster - starting around 1.2PB raw, >> >> eventually growing to around 5PB, for near-line storage of data. Our >> >> storage nodes will probably be 4U systems with 72 data disks each (yeah, >> >> good times). The problem with this becomes obvious - with the way Hadoop >> >> works today, if a disk fails, the datanode process chokes and dies when it >> >> tries to write to it. We've been told repeatedly that Hadoop doesn't >> >> perform well when it operates on RAID arrays, but, to scale efffectively, >> >> we're going to have to do just that - three 24 disk controllers in RAID-6 >> >> mode. How bad is this going to be? JBOD just doesn't scale beyond a >> >> couple >> >> disks per machine, the failure rate will knock machines out of the cluster >> >> too often (and at 60TB per node, rebalancing will take forever, even if I >> >> let it saturate gigabit). >> >> >> >> I appreciate opinions and suggestions. Thanks! >> >> >> >> -j >> > > >