On 01/07/2011 01:16, Ted Dunning wrote:
You have to consider the long-term reliability as well.
Losing an entire set of 10 or 12 disks at once makes the overall reliability
of a large cluster very suspect. This is because it becomes entirely too
likely that two additional drives will fail before the data on the off-line
node can be replicated. For 100 nodes, that can decrease the average time
to data loss down to less than a year.
There's also Rodrigo's work on alternate block placement that doesn't
scatter blocks quite so randomly across a cluster, so a loss of a node
or rack doesn't have adverse effects on so many files
https://issues.apache.org/jira/browse/HDFS-1094
Given that most HDDs failures happen on cluster reboot, it is possible
for 10-12 disks not to come up at the same time, if the cluster has been
up for a while, but like Todd says -worry. At least a bit.
I've heard hints of one FS that actually includes HDD batch data in
block placement, to try and scatter data across batches, and be biased
towards using new HDDs for temp storage during burn-in. Some research
work on doing that to HDFS could be something to keep some postgraduate
busy for a while, "Disk batch-aware block placement".
This can only be mitigated in stock
hadoop by keeping the number of drives relatively low.
now I'm confused. Do you mean #of HDDs/server, or HDDs/filesystem?
Because it seems to me that "stock" HDFS's use in production makes it
one of the filesystems in the planet with the most number of non-RAIDed
HDDs out there -things like Lustre and IBM GPFS go for RAID, as does HP
IBRIX (the last two of which have some form of Hadoop support too, if
you ask nicely). HDD/server numbers matter in that in a small cluster,
it's better to have fewer machines to get more servers to spread the
data over; you don't really want your 100 TB in three 1U servers. As
your cluster grows -and you care more about storage capacity than raw
compute- then the appeal of 24+ TB/server starts to look good, and
that's when you care about the improvements to datanodes handling loss
of worker disk better. Even without that, rebooting the DN may fix
things, but the impact on ongoing work is the big issue -you don't just
lose a replicated block, you lose data.
Cascade failures leading to cluster outages are a separate issue and
normally triggered by switch failure/config than anything else. It
doesn't matter how reliable the hardware is if it gets the wrong
configuration