> while it is true that some disks vintages are better than others, when
> one drive fails, the probability of the other drives failing has not
> changed.  this is the same as if you flip a coin ten times and get ten
> heads, the probability of flipping the same coin and getting heads, is
> still 1/2.

They are not independent events since they see similar wear
and tear and experience the same environmental stress.  In
fact by 2nd/3rd year they are starting to approach the other
end of the bathtub curve of failures.

The disk industry needs to hire real actuaries.  Or may be
google will!

> 
> >> i think this corelation gives people the false impression that they do
> >> fail en masse, but that's really wrong.  the latent errors probablly
> >> happened months ago.
> > 
> > Yes but if there are many latent errors and/or the error rate
> > is going up it is time to replace it.
> 
> maybe.  the goggle paper you cited didn't find a strong correlation
> between smart errors (including block relocation) and failure.

Actually they did find strong correlation for some of the
SMART parameters.  I think what they said was that over 36%
of failed disks had no SMART errors of certain kinds (that
could be because the drive firmware didn't actually count
those errors but it seems they didn't investigate this).  If
you do see more soft errors there is definitely something to
worry about.

> > This is a good idea.  We did this in 1983, back when disks
> > were simpler beasts.  No RAID then of course.
> 
> even a better idea back then.  disks didn't have 1/4 million
> lines of firmware relocating blocks and doing other things to^w
> i mean for you.

I preferred doing bad block forwarding etc. in 100 or so
lines of C code in the OS but it was clear even then that
disk vendors were going to make their disks "smarter".

Reply via email to