On Fri, Nov 21, 2014 at 11:06:19AM -0700, Chris Murphy wrote:
> On Fri, Nov 21, 2014 at 10:42 AM, Zygo Blaxell <zblax...@furryterror.org> 
> wrote:
> 
> > I run 'smartctl -t long' from cron overnight (or whenever the drives
> > are most idle).  You can also set up smartd.conf to launch the self
> > tests; however, the syntax for test scheduling is byzantine compared to
> > cron (and that's saying something!).  On multi-drive systems I schedule
> > a different drive for each night.
> >
> > If you are also doing btrfs scrub, then stagger the scheduling so
> > e.g. smart runs in even weeks and btrfs scrub runs in odd weeks.
> >
> > smartd is OK for monitoring test logs and email alerts.  I've had no
> > problems there.
> 
> Most attributes are always updated without issuing a smart test of any
> kind. A drive I have here only has four offline updateable attributes.

One of those four is Offline_Uncorrectable, which is a really important
attribute to monitor!

> When it comes to bad sectors, the drive won't use a sector that
> persistently fails writes. So you don't really have to worry about
> latent bad sectors that don't have data on them already. The sectors
> you care about are the ones with data. A scrub reads all of those
> sectors.

A scrub reads all the _allocated_ sectors.  A long selftest reads
_everything_, and also exercises the electronics and mechanics of the
drive in ways that normal operation doesn't.  I have several disks that
are less than 25% occupied, which means scrubs will ignore 75% of the
disk surface at any given time.

A sharp increase in the number of bad sectors (no matter how they are
detected) usually indicates a total drive failure is coming.  Many drives
have been nice enough to give me enough warning for their RMA replacements
to be delivered just a few hours before the drive totally fails.

> First the drive could report a read error in which case Btrfs
> raid1/10, and any (md, lvm, hardware) raid can use mirrored data, or
> rebuild it from parity, and write to the affected sector; and also
> this same mechanism happens in normal reads so it's a kind of passive
> scrub. But it happens to miss checking inactively read data, which a
> scrub will check.
> 
> Second, the drive could report no problem, and Btrfs raid1/10 could
> still fix the problem in case of a csum mismatch. And it looks like
> soonish we'll see this apply to raid5/6.
> 
> So I think a nightly long smart test is a bit overkill. I think you
> could do nightly -t short tests which will report problems scrub won't
> notice, such as higher seek times or lower throughput performance. And
> then scrub once a week.

Drives quite often drop a sector or two over the years, and it can
be harmless.  What you want to be watching out for is hundreds of bad
sectors showing up over a period of few days--that means something is
rattling around on the disk platters, damaging the hardware as it goes.
To get that data, you have to test the disks every few days.

> > The drive itself could be failing in some way that prevents recording
> > SMART errors (e.g. because of host timeouts triggering a bus reset,
> > which also prevents the SMART counter update for what was going wrong at
> > the time).  This is unfortunately quite common, especially with drives
> > configured for non-RAID workloads.
> 
> Libata resetting the link should be recorded in kernel messages.

This is true, but the original question was about SMART data coverage.
This is why it's important to monitor both.

> -- 
> Chris Murphy

Attachment: signature.asc
Description: Digital signature

Reply via email to