On Mon, Jan 3, 2011 at 11:55 AM, Jonathan Disher <jdis...@parad.net> wrote:
> The problem is, what do you define as a failure?  If the disk is failing, 
> writes will fail to the filesystem - how does Hadoop differentiate between 
> permissions and physical disk failure?  They both return error.
>

Anything that prevents the volume (mount) from being read or written.
Any failure to write to the volume is considered a failure to use the
volume. Since HDFS doesn't support RO volumes (eg it can't handle a
mount that can be read but not written) these all count as failures
and will cause the volume to taken offline.

> And yeah, the idea of stopping the datanode, removing the affected mount from 
> hdfs-site.xml, and restarting has been discussed.  The problem is, when that 
> disk gets replaced, and readded, then I have horrible internal balance 
> issues.  Thus causing the problem I have now :(

What's the particular issue?  Having an unbalanced set of local disks
should at worst be a performance problem. HDFS doesn't write blocks to
full volumes, it will just start using the other disks.

Thanks,
Eli

> -j
>
> On Jan 3, 2011, at 9:07 AM, Eli Collins wrote:
>
>> Hey Jonathan,
>>
>> There's an option (dfs.datanode.failed.volumes.tolerated, introduced
>> in HDFS-1161) that allows you to specify the number of volumes that
>> are allowed to fail before a datanode stops offering service.
>>
>> There's an operational issue that still needs to be addressed
>> (HDFS-1158) that you should be aware of - the DN will still not start
>> if any of the volumes have failed, so to restart the DN you'll need
>> you'll need to either unconfigure the failed volumes or fix them. I'd
>> like to make DN startup respect the config value so it tolerates
>> failed volumes on startup as well.
>>
>> Thanks,
>> Eli
>>
>> On Sun, Jan 2, 2011 at 7:20 PM, Jonathan Disher <jdis...@parad.net> wrote:
>>> I see that there was a thread on this in December, but I can't retrieve it 
>>> to reply properly, oh well.
>>>
>>> So, I have a 30 node cluster (plus separate namenode, jobtracker, etc).  
>>> Each is a 12 disk machine - two mirrored 250GB OS disks, ten 1TB data disks 
>>> in JBOD.  Original system config was six 1TB data disks - we added the last 
>>> four disks months later.  I'm sure you can all guess, we have some 
>>> interesting internal usage balancing issues on most of the nodes.  To date, 
>>> when individual disks get critically low on space (earlier this week I had 
>>> a node with six disks around 97% full, four around 70%), we've been pulling 
>>> them from the cluster, formatting the data disks, and sticking them back in 
>>> (with a rebalance running to keep the cluster in some semblance of order).
>>>
>>> Obviously if there was a better way to do this, I'd love to see it.  I see 
>>> that there are recommendations of killing the DataNode process and manually 
>>> moving files, but my concern is that the DataNode process will spend an 
>>> enormous amount of time tracking down these moves (currently around 820,000 
>>> blocks/node).  And it's not necessarily easy to automate, so there's the 
>>> danger of nuking blocks, and making the problems worse.  Are there 
>>> alternatives to manual moves (or more automated ways that exist)?  Or has 
>>> my brute-force rebalance got the best chance of success, albeit slowly?
>>>
>>> We are also building a new cluster - starting around 1.2PB raw, eventually 
>>> growing to around 5PB, for near-line storage of data.  Our storage nodes 
>>> will probably be 4U systems with 72 data disks each (yeah, good times).  
>>> The problem with this becomes obvious - with the way Hadoop works today, if 
>>> a disk fails, the datanode process chokes and dies when it tries to write 
>>> to it.  We've been told repeatedly that Hadoop doesn't perform well when it 
>>> operates on RAID arrays, but, to scale efffectively, we're going to have to 
>>> do just that - three 24 disk controllers in RAID-6 mode.  How bad is this 
>>> going to be?  JBOD just doesn't scale beyond a couple disks per machine, 
>>> the failure rate will knock machines out of the cluster too often (and at 
>>> 60TB per node, rebalancing will take forever, even if I let it saturate 
>>> gigabit).
>>>
>>> I appreciate opinions and suggestions.  Thanks!
>>>
>>> -j
>
>

Reply via email to