good writeup, but I'll point out that you are actually talking about three
errors happening here, the first error that triggers the start of the rebuild,
and then two additional errors after that.
Also, while I am not disagreeing with your math, this still doesn't seem right.
Also, is the BER really a permanent error?
If it is, then doing an array scrub every month or so should give you confidence
that you don't have multiple errors lurking out of the drives.
Also, if the BER represents a permanent error, that would say that you will
expect to 'loose a drive' about every four full reads of the array. If you scrub
the array monthly, then this should mean that you loose 3 drives a year, just
due to the reads from the array check (in addition to any actual use of the
array). Given any reasonable amount of real use, this would seem to indicate
that you should be loosing half of your array every year. Drive failure rates
are not that high.
I've got a 160x1T disk array (10x 16 disk RAID 6, no hot spares) that gets
completely re-written about every 8 weeks (it's essentially a circular buffer).
If a full read of 6 drives will fail 25% of the time, then these 160 drives
should almost never be able to complete a single pass without loosing drives.
But they had <10 failures among the 160 drives in the first four years of
operation.
so something just doesn't seem to match reality here.
David Lang
On Sun, 22 Sep 2013, Charles Polisher wrote:
David Lang wrote:
RAID 6 or RAID 5?
I would not expect a single error in that transfer to kill the
entire RAID, just to kill a second disk (and only if you have a
third disk would the array die)
I apologize for the length of my response.
TL;DR: I meant RAID6. Expect is an interesting word. As many as
1 in 40 RAID 6 array rebuilds may fail in a given case. Use
smallish drives (<= 1TB) of enterprise quality, your data
will be safe.
This is a pessimistic walk-through with admitted weaknesses. No
shop I'm familiar with transfers data at the maximum rate 24
hours a day, 7 days a week. But the solace we take in RAID 6
double redundancy is undermined by a key characteristic: to
rebuild a failed drive, every remaining active drive has to
perform a whole-disk transfer. This exposes the array to a
secondary hazard, and then a tertiary hazard, all of which seem
like remote possibilities, but aren't as remote as you might
hope. You'll probably never lose a RAID level 6 array, but it
depends on what you mean by 'probably'. Here's a timeline of a
failure, with both happy and sad outcomes. Starting at time t0,
t0 7 active 3TB elements (disks) are up + one available hot spare.
Array integrity: OK. Array can sustain loss of two elements.
+-------------------------+
| 1 2 3 4 5 P Q Hs | Elements 1..5 are data, P,Q are parity,
+-------------------------+ Hs is Hot standby.
t1 Element 1 fails completely.
The RAID controller removes element 1 from the array.
The hot spare is promoted to active element 1a, the array starts
rebuilding by reconstructing the missing data on the new element.
Array integrity: OK. Array can sustain loss of one element.
+-------------------------+
| F 2 3 4 5 P Q 1a | F is faulted element.
+-------------------------+ 1a is marked for reconstruction.
t2 For 7200RPM drives transferring 145Mbytes/sec, rebuild needs
a minimum of 5.75 hours to finish.
Array integrity: OK. Array can sustain loss of one element.
+-------------------------+
| F 2 3 4 5 P Q 1a | F is faulted element.
+-------------------------+ 1a is being reconstructed.
| | | | | | ^
| | | | | R->^
| | | | R-->--^ Elements 2,3,4,5,P,Q are read
| | | R-->-----^ to reconstruct the failed element
| | R-->--------^ from 144 terabits of data.
| R-->-----------^
R-->--------------+
t3 The sysadmin replaces failed drive (was element 1) with a new drive.
The RAID controller designates it as a new Hot spare.
+-------------------------+
| Hs 2 3 4 5 P Q 1a | 1a is being reconstructed
+-------------------------+ from 2,3,4,5,P,Q.
t4 During the reconstruction of element 1a, an unrecoverable
single bit error is detected in element 3's bit stream.
How likely is this? The probability of a single bit error in
a whole-disk transfer is about
(# bits in bitstream)
p = -----------------------
1 / BER
Here the # of bits in the bitstream is
ilding,
3 terabytes * 8 bits * 6 drives = 144 terabits
BER is the Bit Error Rate of the disk. Manufacturers typically
express this as 1 error in so many bits, after error correction
has been attempted.
For a typical consumer drive, the BER is stated to be no more
than 1 in 10E14 bits (eg, Western Digital Green). With a 3TB
drive (drive sizes are specified in powers of ten) yielding a
likelihood of a bit error in a whole-array transfer of
144 terabits
p1 = --------------- = 0.24 (24%)
10^14 bits
which means 1 / 0.24 = 4.166 whole array transfers between
expected bit errors. (A Monte Carlo simulation with 500 trials
gave a figure of 16.3%, a bit more optimistic).
+-------------------------+
| Hs 2 3 4 5 P Q 1a | 3 is a partially faulted element
+-------------------------+ 1a is being reconstructed
| . | | | | ^ from 2,3,4,5,P,Q.
| . | | | R->^
| . | | R-->--^
| . | R-->-----^
| . R-->--------^
| . . . . . . . .
R-->--------------+
t5 The RAID controller removes element 3 from the array,
continues reconstructing the missing element on 1a,
and schedules Hs to be reconstructed once 1a is restored.
Array integrity: Partial, RAID has lost one element and is
rebuilding one element. Array can't sustain loss of any element.
+-------------------------+
| Hs 2 F 4 5 P Q 1a | F is faulted.
+-------------------------+ 1a is being reconstructed
| | | | | ^ from 2,4,5,P,Q.
| | | | R->^ Hs will be reconstructed next.
| | | R-->--^
| | R-->-----^
| R-->--------^
| ^
R-->--------------+
............................................................
. Here's the scenario with a Happy outcome. Everything .
. works as planned and the sysadmin is home for dinner. .
............................................................
t6 (Happy outcome)
The sysadmin replaces failed drive (element 3) with a new drive.
The RAID controller marks it ready for reconstruction but won't
start rebuilding it until the rebuild underway is complete.
Array integrity: Partial, RAID has lost one element and is
rebuilding one element. Array can't sustain loss of any element.
+-------------------------+
| Hs 2 3a 4 5 P Q 1a | 1a is being reconstructed.
+-------------------------+ Hs will be reconstructed next.
| | | | | ^ 3a will be reconstructed after Hs.
| | | | R->^
| | | R-->--^
| | R-->-----^
| R-->--------^
| ^
R-->--------------+
**************************************************************
* The sysadmin saddles the unicorn and rides off. Well done! *
**************************************************************
t7 (Happy outcome)
The rebuild of element 1a finishes at t1 + 5.75 hours.
The controller starts rebuilding element 3a from the
active elements 2,4,5,P,Q,1a.
144 terabits are transferred with no detected errors.
Array integrity: OK. Array can sustain loss of one element.
+-------------------------+
| Hs 2 3a 4 5 P Q 1a | 3a is being reconstructed.
+-------------------------+ 2,4,5,P,Q,1a are active.
| ^ | | | | | Hs will be reconstructed next.
| ^<-R | | | |
| ^--<--R | | |
| ^-----<--R | |
| ^--------<--R |
| ^ |
R->+-----------<--R
t8 (Happy outcome)
At t7 + 5.75 hours the rebuild of element 3a is done.
7 active elements are up, plus one available hot spare.
Array integrity: OK. Array can sustain loss of two elements.
+-------------------------+
| 1 2 3 4 5 P Q Hs | Elements 1..5 are data, P,Q are parity,
+-------------------------+ Hs is Hot standby.
.............................................................
. Now we'll work through a Sad scenario. The rebuilding .
. hits a slight (1 bit) snag. Cold pudding for dinner. .
.............................................................
t6 (Sad outcome)
The operator replaces failed drive (element 3) with a new drive.
The RAID controller marks it ready for reconstruction but won't
start rebuilding it until the current rebuild is finished.
Array integrity: Partial, RAID has lost one element and is
rebuilding one element, and will rebuild a 2nd element soon.
Array can't sustain loss of an element.
+-------------------------+
| Hs 2 3a 4 5 P Q 1a | 1a is being reconstructed.
+-------------------------+ 3a is marked to be reconstructed.
| | | | | ^
| | | | R->^
| | | R-->--^
| | R-->-----^
| R-->--------^
| ^
R-->--------------+
t7 (Sad outcome)
During the rebuild of 1a, a *second* drive (element 4) reports
an uncorrectable single-bit error to the RAID controller.
The RAID controller removes element 4 from the array,
The array is failed and goes offline. Probably with some
effort all data except one stripe can be reconstructed
with the faulted element formerly known as 3, but it's
complicated and a bit risky and unusual and takes time.
How likely is this? Using a Monte Carlo simulation assuming bit
errors on multiple drives are uncorrelated, and normally
distributed in the bitstream, running 500 trials the results at
the 95% confidence level: 36.7 to 66.2 (mean 47.2) rebuilds
between hitting 2 bit errors from 2 different drives.
+-------------------------+
| Hs 2 3a F 5 P Q 1a | F is faulted.
+-------------------------+ Hs is marked for rebuilding.
| | | | ^ 3a is new and marked for rebuilding.
| | | R->^
| | R-->--^
| R-->-----^
| ^
| ^
R-->--------------+
t9 (Sad outcome)
Restore the array from backups. Up to 28.75 hours is
needed to write 3TB x 5 data elements to a RAID6, so maybe half that
is actually needed because the array isn't full, right?
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
http://lopsa.org/