On Wed, Sep 18, 2013 at 10:11:11AM -0700, Tracy Reed wrote: > On Wed, Sep 18, 2013 at 08:11:24AM PDT, Charles Polisher spake thusly: > > - Monte-carlo simulations of RAID systems confirmed > > a batch of disk drives was vastly exceeding the claimed AFR, > > the vendor eventually copped to a quality problem. > > I'd like to know more about how this was done.
My shop had two RAID failures with data loss in four years, which was supposed not to be possible, which is what got me interested in how reliable RAID actually is. Fun fact: getting struck by lightening is not a rare event if your name is Roy "Dooms" Sullivan*. You can find a spreadsheet on montecarlito.com with a pre-built general Monte Carlo model. Elerath & Pecht give details of RAID array reliability in "Enhanced Reliability Modeling of RAID Storage Systems", as do other authors. Add drive reliability specs and you're all set. Some results surprised me. For example, a RAID6 single-disk failure can trigger whole-disk transfers from every remaining drive all at once. One uncorrectable error anywhere in that whole bitstream can cause total data loss. With large drives an uncorrected error can be expected as often as much as 1 in every 16 whole-disk transfers. Not as reliable as you might hope. Implementation details can make a huge difference, like scrubbing or correlated failures due to heat or vibration. Hope that's what you wanted, -- Charles * http://en.wikipedia.org/wiki/Roy_Sullivan _______________________________________________ Discuss mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/
