On Tue, 4 Oct 2005, Douglas Alan wrote:
> Eric Smith <[EMAIL PROTECTED]> wrote: > >... > I've had two drives die at the same time on a RAID. Good thing the > RAID was only used as our backup server. I'd never trust RAID again to > be any kind of security against disk failure. > > |>oug > >From http://www.nber.org/sys-admin/linux-nas-raid.html : Why do drive failures come in pairs? Most of the drives in our NAS boxes and drive arrays claim a MTBF of 500,000 hours. That's about 2% per year. With three drives the chance of at least one failing is a little less than 6%. (1-(1-.98)^3). Our experience is that such numbers are at least a reasonable approximation of reality. (We especially like the 5400 RPM Maxtor 5A300J0 300GB drives for long life). Suppose you have three drives in a RAID 5. If it takes 24 hours to replace and reconstruct a failed drive, one is tempted to calculate that the chance of a second drive failing before full redundancy is established is about .02/365, or about one in a hundred thousand. The total probability of a double failure seems like it should be about 6 in a million per year. Our double failure rate is about 5 orders of magnitude worse than that - the majority of single drive failures are followed by a second drive failure before redundancy is established. This prevents rebuilding the array with a new drive replacing the original failed drive, however you can probably recover most files if you stay in degraded mode. It isn't that failures are correlated because drives are from the same batch, or the controller is at fault, or the environment is bad (common electrical spike or heat problem). The fault lies with the Linux md driver, which stops rebuilding parity after a drive failure at the first point it encounters a uncorrectable read error on the remaining "good" drives. Of course with two drives unavailable, there isn't an unambiguous reconstruction of the bad sector, so it might be best to go to the backups instead of continuing. At least that is the apparently the reason for the decision. Alternatively, if the first drive failed was readable on that sector, (even if not reading some other sectors) it should be possible to fully recover all the data with a high degree of confidence even if a second drive is failed later. Since that is far from an unusual situation (a drive will be failed for a single uncorrectable error even if further reads are possible on other sectors) it isn't clear to us why that isn't done. Even if that sector isn't readable, logging the bad block, writing zeroes to the targets, and going on might be better than simply giving up. A single unreadable sector isn't unusual among the tens of millions of sectors on a modern drive. If the sector has never been written to, there is no occaison for the drive electronics or the OS to even know it is bad. If the OS tried to write to it, the drive would automatically remap the sector and no damage would be done - not even a log entry. But that one bad sector will render the entire array unrecoverable no matter where on the disk it is if one other drive has already been failed. Let's repeat the reliability calculation with our new knowledge of the situation. In our experience perhaps half of drives have at least one unreadable sector in the first year. Again assume a 6 percent chance of a single failure. The chance of at least one of the remaining two drives having a bad sector is 75% (1-(1-.5)^2). So the RAID 5 failure rate is about 4.5%/year, which is .5% MORE than the 4% failure rate one would expect from a two drive RAID 0 with the same capacity. Alternatively, if you just had two drives with a partition on each and no RAID of any kind, the chance of a failure would still be 4%/year but only half the data loss per incident, which is considerably better than the RAID 5 can even hope for under the current reconstruction policy even with the most expensive hardware. We don't know what the reconstruction policy is for other raid controllers, drivers or NAS devices. None of the boxes we bought acknowledged this "gotcha" but none promised to avoid it either. We assume Netapp and ECCS have this under control, since we have had several single drive failures on those devices with no difficulty resyncing. We have not had a single drive failure yet in the MVD based boxes, so we really don't know what they will do. Some mitigation of the danger is possible. You could read and write the entire drive surface periodically, and replace any drives with even a single uncorrectable block visible. A daemon <a href="http://sourceforge.net/projects/smartmontools/"> Smartd </a> is available for Linux that will scan the disk in background for errors and report them. We had been running that, but ignored errors on unwritten sectors, because we were used to such errors disappearing when the sector was written (and the bad sector remapped). Our current inclination is to shift to a recent 3ware controller, which we understand has a "continue on error" rebuild policy available as an option in the array setup. But we would really like to know more about just what that means. What do the apparently similar RAID controllers from Mylex, LSI Logic and Adaptec do about this? A look at their web sites reveals no information. Daniel Feenberg feenberg isat nber dotte org _______________________________________________ bblisa mailing list [email protected] http://www.bblisa.org/mailman/listinfo/bblisa
