> > drive mfgrs don't report write error rates. i would consider any
> > drive with write errors to be dead as fried chicken. a more
> > interesting question is what is the chance you can read the
> > written data back correctly. in that case with desktop drives,
> > you have a
> > 8 bits/byte * 1e12 bytes / 1e14 bits/ure = 8%
>
> Isn't that the probability of getting a bad sector when you
> read a terabyte? In other words, this is not related to the
> disk size but how much you read from the given disk. Granted
> that when you "resilver" you have no choice but to read the
> entire disk and that is why just one redundant disk is not
> good enough for TB size disks (if you lose a disk there is 8%
> chance you copied a bad block in resilvering a mirror).
see below. i think you're confusing a single disk 8% chance
of failure with a 3 disk tb array, with a 1e-7% chance of failure.
i would think this is acceptable. at these low levels, something
else is going to get you — like drives failing unindependently.
say because of power problems.
> > i'm a little to lazy to calcuate what the probabilty is that
> > another sector in the row is also bad. (this depends on
> > stripe size, the number of disks in the raid, etc.) but it's
> > safe to say that it's pretty small. for a 3 disk raid 5 with
> > 64k stripes it would be something like
> > 8 bites/byte * 64k *3 / 1e14 = 1e-8
>
> The read error prob. for a 64K byte stripe is 3*2^19/10^14 ~=
> 3*0.5E-8, since three 64k byte blocks have to be read. The
> unrecoverable case is two of them being bad at the same time.
> The prob. of this is 3*0.25E-16 (not sure I did this right --
thanks for noticing that. i think i didn't explain myself well
i was calculating the rough probability of a ure in reading the
*whole array*, not just one stripe.
to do this more methodicly using your method, we need
to count up all the possible ways of getting a double fail
with 3 disks and multiply by the probability of getting that
sort of failure and then add 'em up. if 0 is ok and 1 is fail,
then i think there are these cases:
0 0 0
1 0 0
0 1 0
0 0 1
1 1 0
1 0 1
0 1 1
1 1 1
so there are 4 ways to fail. 3 double fail have a probability of
3*(2^9 bits * 1e-14 1/ bit)^2
and the triple fail has a probability of
(2^9 bits * 1e-14 1/ bit)^3
so we have
3*(2^9 bits * 1e-14 1/ bit)^2 + (2^9 bits * 1e-14 1/ bit)^3 ~=
3*(2^9 bits * 1e-14 1/ bit)^2
= 8.24633720832e-17
that's per stripe. if we multiply by 1e12/(64*1024) stripes/array,
we have
= 1.2582912e-09
which is remarkably close to my lousy first guess. so we went
from 8e-2 to 1e-9 for an improvement of 7 orders of magnitude.
> we have to consider the exact same sector # going bad in two
> of the three disks and there are three such pairs).
the exact sector doesn't matter. i don't know any
implementations that try to do partial stripe recovery.
- erik