Re: DataNode internal balancing, performance recommendations

Jonathan Disher Mon, 03 Jan 2011 22:29:40 -0800

That's what we've been doing.  Again, the problem is, we still have to pull the 
datanode out of rotation and change config, replace disk, put it back... even 
if I have spares on hand and finish this in a few minutes, I still have one 
empty disk and many tens of not-empty disks.  Monitoring and identifying the 
failure isn't the problem, we have that down pat.  I'm hoping for a better way 
to re-balance the disks in the node after a failure.  I suspect the sad answer 
is that what I'm doing now is the best thing for it.


-j

On Jan 3, 2011, at 10:21 PM, Esteban Gutierrez Moguel wrote:

> 
> Jonathan,
> 
> Hadoop will throw an exception according to the kind of error: 
> AccessControlException if its permission related or IOException for any other 
> disk related task.
> 
> A safer approach to handle physical failures would be monitoring syslog 
> messages (Syslog4j, nagios, ganglia, etc.) and if you are lucky enough and 
> the node doesn't hangs after the disk failure, you could shutdown it 
> gracefully.
> 
> esteban.
> 
> On Mon, Jan 3, 2011 at 13:55, Jonathan Disher <jdis...@parad.net> wrote:
> The problem is, what do you define as a failure?  If the disk is failing, 
> writes will fail to the filesystem - how does Hadoop differentiate between 
> permissions and physical disk failure?  They both return error.
> 
> And yeah, the idea of stopping the datanode, removing the affected mount from 
> hdfs-site.xml, and restarting has been discussed.  The problem is, when that 
> disk gets replaced, and readded, then I have horrible internal balance 
> issues.  Thus causing the problem I have now :(
> 
> -j
> 
> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
> 
> > Hey Jonathan,
> >
> > There's an option (dfs.datanode.failed.volumes.tolerated, introduced
> > in HDFS-1161) that allows you to specify the number of volumes that
> > are allowed to fail before a datanode stops offering service.
> >
> > There's an operational issue that still needs to be addressed
> > (HDFS-1158) that you should be aware of - the DN will still not start
> > if any of the volumes have failed, so to restart the DN you'll need
> > you'll need to either unconfigure the failed volumes or fix them. I'd
> > like to make DN startup respect the config value so it tolerates
> > failed volumes on startup as well.
> >
> > Thanks,
> > Eli
> >
> > On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdis...@parad.net> wrote:
> >> I see that there was a thread on this in December, but I can't retrieve it 
> >> to reply properly, oh well.
> >>
> >> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  
> >> Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data 
> >> disks in JBOD.  Original system config was six 1TB data disks - we added 
> >> the last four disks months later.  I'm sure you can all guess, we have 
> >> some interesting internal usage balancing issues on most of the nodes.  To 
> >> date, when individual disks get critically low on space (earlier this week 
> >> I had a node with six disks around 97% full, four around 70%), we've been 
> >> pulling them from the cluster, formatting the data disks, and sticking 
> >> them back in (with a rebalance running to keep the cluster in some 
> >> semblance of order).
> >>
> >> Obviously if there was a better way to do this, I'd love to see it.  I see 
> >> that there are recommendations of killing the DataNode process and 
> >> manually moving files, but my concern is that the DataNode process will 
> >> spend an enormous amount of time tracking down these moves (currently 
> >> around 820,000 blocks/node).  And it's not necessarily easy to automate, 
> >> so there's the danger of nuking blocks, and making the problems worse.  
> >> Are there alternatives to manual moves (or more automated ways that 
> >> exist)?  Or has my brute-force rebalance got the best chance of success, 
> >> albeit slowly?
> >>
> >> We are also building a new cluster - starting around 1.2PB raw, eventually 
> >> growing to around 5PB, for near-line storage of data.  Our storage nodes 
> >> will probably be 4U systems with 72 data disks each (yeah, good times).  
> >> The problem with this becomes obvious - with the way Hadoop works today, 
> >> if a disk fails, the datanode process chokes and dies when it tries to 
> >> write to it.  We've been told repeatedly that Hadoop doesn't perform well 
> >> when it operates on RAID arrays, but, to scale efffectively, we're going 
> >> to have to do just that - three 24 disk controllers in RAID-6 mode.  How 
> >> bad is this going to be?  JBOD just doesn't scale beyond a couple disks 
> >> per machine, the failure rate will knock machines out of the cluster too 
> >> often (and at 60TB per node, rebalancing will take forever, even if I let 
> >> it saturate gigabit).
> >>
> >> I appreciate opinions and suggestions.  Thanks!
> >>
> >> -j
> 
>

Re: DataNode internal balancing, performance recommendations

Reply via email to