MTTF is a difficult number. Popular papers include: http://db.usenix.org/events/fast07/tech/schroeder/schroeder_html/index.html, http://labs.google.com/papers/disk_failures.pdf
Ted is assuming a MTTF of 25kHours; I think that's overly pessimistic, although both papers indicate that MTTF is a crappy way to model disk lifetime. I think a lot has to do with the quality of the batch of hard drives you get and operating conditions. Brain On Aug 10, 2011, at 2:19 PM, Luke Lu wrote: > On Wed, Aug 10, 2011 at 10:40 AM, Ted Dunning <[email protected]> wrote: >> To be specific, taking a 100 node x 10 disk x 2 TB configuration with drive >> MTBF of 1000 days, we should be seeing drive failures on average once per >> day.... >> For a 10,000 node cluster, however, we should expect the average rate of >> disk failure rate of one failure every 2.5 hours. > > Do you have real data to back the analysis? You assume a uniform disk > failure distribution, which is absolutely not true. I can only say > that our ops data across 40000+ nodes shows that the above analysis is > not even close. (This is assuming that the ops know what they are > doing though :) > > __Luke
smime.p7s
Description: S/MIME cryptographic signature
