Package: smartmontools
Version: 5.41+svn3365-1~bpo60+1
Severity: important

On a high IO server a periodic smart short-test sometimes is unable to complete 
within 6+ hours which otherwise completes in under 5 minutes.
The server in question has complex disk layout with several RAID levels on same 
set of HDD with LVM over RAIDs and LVM over bare partitions.
Server has 3 identical drives: Western Digital RE3 Serial ATA 
(WD1002FBYS-02A6B0) with firmware 03.00C06.

Two (sda, sdb) of those 3 drives have significantly higher load than the third 
(sdc).

Short self tests are configures to run once a week.

Here is some history:

# zgrep -in self-test /var/log/daemon.log*
/var/log/daemon.log.1:33902:Nov 24 02:23:20 axf smartd[3749]: Device: /dev/sda 
[SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.1:33904:Nov 24 02:23:20 axf smartd[3749]: Device: /dev/sdb 
[SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.1:33905:Nov 24 02:23:20 axf smartd[3749]: Device: /dev/sdc 
[SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.1:34011:Nov 24 02:53:33 axf smartd[3749]: Device: /dev/sda 
[SAT], self-test in progress, 10% remaining
/var/log/daemon.log.1:34014:Nov 24 02:53:44 axf smartd[3749]: Device: /dev/sdb 
[SAT], self-test in progress, 10% remaining
/var/log/daemon.log.1:34015:Nov 24 02:53:49 axf smartd[3749]: Device: /dev/sdc 
[SAT], previous self-test completed without error
/var/log/daemon.log.1:34958:Nov 24 06:23:20 axf smartd[3749]: Device: /dev/sda 
[SAT], previous self-test was aborted by the host
/var/log/daemon.log.1:34960:Nov 24 06:23:20 axf smartd[3749]: Device: /dev/sdb 
[SAT], previous self-test was aborted by the host

/var/log/daemon.log.2.gz:37189:Nov 17 02:23:20 axf smartd[3749]: Device: 
/dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.2.gz:37190:Nov 17 02:23:21 axf smartd[3749]: Device: 
/dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.2.gz:37191:Nov 17 02:23:21 axf smartd[3749]: Device: 
/dev/sdc [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.2.gz:37248:Nov 17 02:53:20 axf smartd[3749]: Device: 
/dev/sda [SAT], previous self-test completed without error
/var/log/daemon.log.2.gz:37250:Nov 17 02:53:20 axf smartd[3749]: Device: 
/dev/sdb [SAT], previous self-test completed without error
/var/log/daemon.log.2.gz:37251:Nov 17 02:53:20 axf smartd[3749]: Device: 
/dev/sdc [SAT], previous self-test completed without error

/var/log/daemon.log.3.gz:33914:Nov 10 02:29:18 axf smartd[11775]: Device: 
/dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.3.gz:33915:Nov 10 02:29:18 axf smartd[11775]: Device: 
/dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.3.gz:33916:Nov 10 02:29:18 axf smartd[11775]: Device: 
/dev/sdc [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.3.gz:34068:Nov 10 02:59:27 axf smartd[11775]: Device: 
/dev/sda [SAT], self-test in progress, 10% remaining
/var/log/daemon.log.3.gz:34071:Nov 10 02:59:59 axf smartd[11775]: Device: 
/dev/sdb [SAT], self-test in progress, 10% remaining
/var/log/daemon.log.3.gz:34072:Nov 10 03:00:01 axf smartd[11775]: Device: 
/dev/sdc [SAT], previous self-test completed without error
/var/log/daemon.log.3.gz:35483:Nov 10 08:29:18 axf smartd[11775]: Device: 
/dev/sda [SAT], previous self-test was aborted by the host
/var/log/daemon.log.3.gz:35484:Nov 10 08:29:18 axf smartd[11775]: Device: 
/dev/sdb [SAT], previous self-test was aborted by the host

/var/log/daemon.log.4.gz:32738:Nov  3 02:29:18 axf smartd[11775]: Device: 
/dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.4.gz:32739:Nov  3 02:29:18 axf smartd[11775]: Device: 
/dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.4.gz:32740:Nov  3 02:29:18 axf smartd[11775]: Device: 
/dev/sdc [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.4.gz:32801:Nov  3 02:59:18 axf smartd[11775]: Device: 
/dev/sda [SAT], previous self-test completed without error
/var/log/daemon.log.4.gz:32802:Nov  3 02:59:18 axf smartd[11775]: Device: 
/dev/sdb [SAT], previous self-test completed without error
/var/log/daemon.log.4.gz:32804:Nov  3 02:59:18 axf smartd[11775]: Device: 
/dev/sdc [SAT], previous self-test completed without error

/var/log/daemon.log.5.gz:33087:Oct 27 01:12:33 axf smartd[23962]: Device: 
/dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.5.gz:33088:Oct 27 01:12:33 axf smartd[23962]: Device: 
/dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.5.gz:33089:Oct 27 01:12:33 axf smartd[23962]: Device: 
/dev/sdc [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.5.gz:33271:Oct 27 01:42:37 axf smartd[23962]: Device: 
/dev/sda [SAT], self-test in progress, 10% remaining
/var/log/daemon.log.5.gz:33273:Oct 27 01:42:52 axf smartd[23962]: Device: 
/dev/sdb [SAT], self-test in progress, 10% remaining
/var/log/daemon.log.5.gz:33288:Oct 27 01:42:52 axf smartd[23962]: Device: 
/dev/sdc [SAT], previous self-test completed without error
/var/log/daemon.log.5.gz:33411:Oct 27 02:12:33 axf smartd[23962]: Device: 
/dev/sda [SAT], previous self-test completed without error
/var/log/daemon.log.5.gz:33412:Oct 27 02:12:33 axf smartd[23962]: Device: 
/dev/sdb [SAT], previous self-test completed without error

/var/log/daemon.log.6.gz:29186:Oct 20 01:12:33 axf smartd[23962]: Device: 
/dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.6.gz:29187:Oct 20 01:12:33 axf smartd[23962]: Device: 
/dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.6.gz:29188:Oct 20 01:12:33 axf smartd[23962]: Device: 
/dev/sdc [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.6.gz:29310:Oct 20 01:42:33 axf smartd[23962]: Device: 
/dev/sda [SAT], previous self-test completed without error
/var/log/daemon.log.6.gz:29311:Oct 20 01:42:33 axf smartd[23962]: Device: 
/dev/sdb [SAT], previous self-test completed without error
/var/log/daemon.log.6.gz:29312:Oct 20 01:42:33 axf smartd[23962]: Device: 
/dev/sdc [SAT], previous self-test completed without error

/var/log/daemon.log.7.gz:23353:Oct 13 01:12:33 axf smartd[23962]: Device: 
/dev/sda [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.7.gz:23354:Oct 13 01:12:34 axf smartd[23962]: Device: 
/dev/sdb [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.7.gz:23355:Oct 13 01:12:34 axf smartd[23962]: Device: 
/dev/sdc [SAT], starting scheduled Short Self-Test.
/var/log/daemon.log.7.gz:23416:Oct 13 01:42:33 axf smartd[23962]: Device: 
/dev/sda [SAT], previous self-test completed without error
/var/log/daemon.log.7.gz:23417:Oct 13 01:42:33 axf smartd[23962]: Device: 
/dev/sdb [SAT], previous self-test completed without error
/var/log/daemon.log.7.gz:23418:Oct 13 01:42:33 axf smartd[23962]: Device: 
/dev/sdc [SAT], previous self-test completed without error


The drive with smaller load (sdc) did not have problems completing short-test, 
which is not the case with rest two drives (sda, sdb).

As far as I can see when the IO is very high (~80% disk utilization) the test 
is getting extremelly slow or even restarts or hangs
causing disks to become extremelly slow and renders the whole system unusable 
as simple disk operations take >30 seconds to complete
and load average jumps over 80.

Also this cause samba clients to restart connections due to timeout which 
causes new smbd processes to start which grows the total
number of smbd processes running system out of memory. Swapping in this 
situation adding to an effect.

According to atop the normal average io is <5ms, during stuck selftest the 
average io jumps to ~100ms, this is ~10 writes/second
~10k each!

Tested on: linux-image-3.2.0-0.bpo.2-amd64 (3.2.20-1~bpo60+1), smartmontools: 
5.39.1+svn3124-2 and 5.41+svn3365-1~bpo60+1
The results are the same.

There are bugreports describing this situation:
https://bugzilla.redhat.com/show_bug.cgi?id=503344
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=503439
http://www.mail-archive.com/freebsd-hackers@freebsd.org/msg67741.html
http://pl.digipedia.org/usenet/thread/19509/1268/#post1206

-- Package-specific info:
Output of /usr/share/bug/smartmontools:

-- System Information:
Debian Release: 6.0.6
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-0.bpo.2-amd64 (SMP w/8 CPU cores)
Locale: LANG=ru_UA.UTF-8, LC_CTYPE=ru_UA.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages smartmontools depends on:
ii  debianutils             3.4              Miscellaneous utilities specific t
ii  libc6                   2.11.3-4         Embedded GNU C Library: Shared lib
ii  libcap-ng0              0.6.4-1          An alternate posix capabilities li
ii  libgcc1                 1:4.4.5-8        GCC support library
ii  libselinux1             2.0.96-1         SELinux runtime shared libraries
ii  libstdc++6              4.4.5-8          The GNU Standard C++ Library v3
ii  lsb-base                3.2-23.2squeeze1 Linux Standard Base 3.2 init scrip

Versions of packages smartmontools recommends:
ii  bsd-mailx [mailx]  8.1.2-0.20100314cvs-1 simple mail user agent
ii  heirloom-mailx [ma 12.4-2                feature-rich BSD mail(1)

Versions of packages smartmontools suggests:
pn  gsmartcontrol                 <none>     (no description available)
pn  smart-notifier                <none>     (no description available)

-- Configuration Files:
/etc/default/smartmontools changed:
start_smartd=yes
smartd_opts="--interval=1800"

/etc/smartd.conf changed:
DEVICESCAN -s (S/../../6/02) -m root -M exec 
/usr/share/smartmontools/smartd-runner


-- no debconf information


-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to