On 01/07/2011 01:16, Ted Dunning wrote:
You have to consider the long-term reliability as well.

Losing an entire set of 10 or 12 disks at once makes the overall reliability
of a large cluster very suspect.  This is because it becomes entirely too
likely that two additional drives will fail before the data on the off-line
node can be replicated.  For 100 nodes, that can decrease the average time
to data loss down to less than a year.

There's also Rodrigo's work on alternate block placement that doesn't scatter blocks quite so randomly across a cluster, so a loss of a node or rack doesn't have adverse effects on so many files

https://issues.apache.org/jira/browse/HDFS-1094

Given that most HDDs failures happen on cluster reboot, it is possible for 10-12 disks not to come up at the same time, if the cluster has been up for a while, but like Todd says -worry. At least a bit.


I've heard hints of one FS that actually includes HDD batch data in block placement, to try and scatter data across batches, and be biased towards using new HDDs for temp storage during burn-in. Some research work on doing that to HDFS could be something to keep some postgraduate busy for a while, "Disk batch-aware block placement".

This can only be mitigated in stock
hadoop by keeping the number of drives relatively low.

now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem? Because it seems to me that "stock" HDFS's use in production makes it one of the filesystems in the planet with the most number of non-RAIDed HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP IBRIX (the last two of which have some form of Hadoop support too, if you ask nicely). HDD/server numbers matter in that in a small cluster, it's better to have fewer machines to get more servers to spread the data over; you don't really want your 100 TB in three 1U servers. As your cluster grows -and you care more about storage capacity than raw compute- then the appeal of 24+ TB/server starts to look good, and that's when you care about the improvements to datanodes handling loss of worker disk better. Even without that, rebooting the DN may fix things, but the impact on ongoing work is the big issue -you don't just lose a replicated block, you lose data.

Cascade failures leading to cluster outages are a separate issue and normally triggered by switch failure/config than anything else. It doesn't matter how reliable the hardware is if it gets the wrong configuration

Reply via email to