On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote: > On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell <zblax...@furryterror.org> > wrote: > > > I run 'smartctl -t long' from cron overnight (or whenever the drives > > are most idle). You can also set up smartd.conf to launch the self > > tests; however, the syntax for test scheduling is byzantine compared to > > cron (and that's saying something!). On multi-drive systems I schedule > > a different drive for each night. > > > > If you are also doing btrfs scrub, then stagger the scheduling so > > e.g. smart runs in even weeks and btrfs scrub runs in odd weeks. > > > > smartd is OK for monitoring test logs and email alerts. I've had no > > problems there. > > Most attributes are always updated without issuing a smart test of any > kind. A drive I have here only has four offline updateable attributes.
One of those four is Offline_Uncorrectable, which is a really important attribute to monitor! > When it comes to bad sectors, the drive won't use a sector that > persistently fails writes. So you don't really have to worry about > latent bad sectors that don't have data on them already. The sectors > you care about are the ones with data. A scrub reads all of those > sectors. A scrub reads all the _allocated_ sectors. A long selftest reads _everything_, and also exercises the electronics and mechanics of the drive in ways that normal operation doesn't. I have several disks that are less than 25% occupied, which means scrubs will ignore 75% of the disk surface at any given time. A sharp increase in the number of bad sectors (no matter how they are detected) usually indicates a total drive failure is coming. Many drives have been nice enough to give me enough warning for their RMA replacements to be delivered just a few hours before the drive totally fails. > First the drive could report a read error in which case Btrfs > raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or > rebuild it from parity, and write to the affected sector; and also > this same mechanism happens in normal reads so it's a kind of passive > scrub. But it happens to miss checking inactively read data, which a > scrub will check. > > Second, the drive could report no problem, and Btrfs raid1/10 could > still fix the problem in case of a csum mismatch. And it looks like > soonish we'll see this apply to raid5/6. > > So I think a nightly long smart test is a bit overkill. I think you > could do nightly -t short tests which will report problems scrub won't > notice, such as higher seek times or lower throughput performance. And > then scrub once a week. Drives quite often drop a sector or two over the years, and it can be harmless. What you want to be watching out for is hundreds of bad sectors showing up over a period of few days--that means something is rattling around on the disk platters, damaging the hardware as it goes. To get that data, you have to test the disks every few days. > > The drive itself could be failing in some way that prevents recording > > SMART errors (e.g. because of host timeouts triggering a bus reset, > > which also prevents the SMART counter update for what was going wrong at > > the time). This is unfortunately quite common, especially with drives > > configured for non-RAID workloads. > > Libata resetting the link should be recorded in kernel messages. This is true, but the original question was about SMART data coverage. This is why it's important to monitor both. > -- > Chris Murphy
signature.asc
Description: Digital signature