> Hi, > > On 2/05/2013 12:06 AM, James Harper wrote: > > I have just had two drives failed in a server today. One is mostly part of a > RAID0 set (which is in turn part of a DRBD, so we're still good) and a small > partition that is part of a RAID1, which hasn't been failed (errors are about > 1.3TB along a 2TB disk). The other is one I was testing, it wasn't > particularly > new and doesn't really matter. > > > > Both drives have logged read errors under Linux kernel, both report drive is > healthy status (SMART overall-health self-assessment test result: PASSED), > and both say "Completed: read failure" almost immediately when I do a > SMART self test (short test or long). > > > > I don't really have any trouble with the fact that two drives have failed, > > but > I'm really surprised that SMART still reports that the drive is good when it > is > clearly not... what's with that? > > This from Google: > > "Our analysis identifies several parameters from the drive's > self monitoring facility (SMART) that correlate highly with > failures. Despite this high correlation, we conclude that mod- > els based on SMART parameters alone are unlikely to be useful > for predicting individual drive failures. Surprisingly, we found > that temperature and activity levels were much less correlated > with drive failures than previously reported." > > In a nutshell, SMART is not a good indicator of pending failure ... use > it as an indication only, but certainly don't count on it. But really, > SMART is next to useless overall, so it isn't even much of a "real" > indicator.... YMMV. >
It's frustrating because a simple "if hard read errors > 0 || failed self tests > 0 then drive = not okay" would have meant I could just read the SMART health indicator and eject the drive from the array (or whatever it belonged to). James _______________________________________________ luv-main mailing list [email protected] http://lists.luv.asn.au/listinfo/luv-main
