> while it is true that some disks vintages are better than others, when > one drive fails, the probability of the other drives failing has not > changed. this is the same as if you flip a coin ten times and get ten > heads, the probability of flipping the same coin and getting heads, is > still 1/2.
They are not independent events since they see similar wear and tear and experience the same environmental stress. In fact by 2nd/3rd year they are starting to approach the other end of the bathtub curve of failures. The disk industry needs to hire real actuaries. Or may be google will! > > >> i think this corelation gives people the false impression that they do > >> fail en masse, but that's really wrong. the latent errors probablly > >> happened months ago. > > > > Yes but if there are many latent errors and/or the error rate > > is going up it is time to replace it. > > maybe. the goggle paper you cited didn't find a strong correlation > between smart errors (including block relocation) and failure. Actually they did find strong correlation for some of the SMART parameters. I think what they said was that over 36% of failed disks had no SMART errors of certain kinds (that could be because the drive firmware didn't actually count those errors but it seems they didn't investigate this). If you do see more soft errors there is definitely something to worry about. > > This is a good idea. We did this in 1983, back when disks > > were simpler beasts. No RAID then of course. > > even a better idea back then. disks didn't have 1/4 million > lines of firmware relocating blocks and doing other things to^w > i mean for you. I preferred doing bad block forwarding etc. in 100 or so lines of C code in the OS but it was clear even then that disk vendors were going to make their disks "smarter".
