detecting bad disks

Derick Siddoway Thu, 08 Nov 2007 07:47:06 -0800

Trying to copy a file from one filesystem to another, I kept getting
input/output errors.  I noticed these messages in the logs:


wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416; 
cn 762 tn 5 sn 5), retrying
wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; 
cn 762 tn 5 sn 6), retrying
wd1a:  uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417; 
cn 762 tn 5 sn 6)

Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that
disk).  I've already started moving the data to a different disk.

Now, I thought I was going to be alerted to this sort of thing automatically
because of an entry like this one in the crontab:

0 * * * *       /sbin/atactl /dev/wd0c smartstatus >/dev/null

However, when I run this by hand, I get

[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus
No SMART threshold exceeded

So clearly, the SMART stuff wasn't going to tell me about this.

However:
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr
Attributes table revision: 16
ID      Attribute name                  Threshold       Value   Raw
  1     Raw Read Error Rate               51            199     0x000000000081
  3     Spin Up Time                      21            123     0x000000001127
  4     Start/Stop Count                  40             99     0x00000000056f
  5     Reallocated Sector Count         140            200     0x000000000000
  7     Seek Error Rate                   51            200     0x000000000000
  9     Power-on Hours Count               0             73     0x000000004da4
 10     Spin Retry Count                  51            100     0x000000000000
 11     Unknown                           51            100     0x000000000000
 12     Device Power Cycle Count           0             99     0x00000000056e
194     Temperature                        0            101     0x000000000031
196     Reallocation Event Count           0            200     0x000000000000
197     Current Pending Sector Count       0            197     0x000000000068
198     Off-line Scan Uncorrectable Sect   0            199     0x000000000032
199     Ultra DMA CRC Error Count          0            200     0x000000000000

I see a number of values that exceed the preset threshholds.
But I see the same kinds of values on the other three drives:

[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr
Attributes table revision: 16
ID      Attribute name                  Threshold       Value   Raw
  1     Raw Read Error Rate               51            200     0x000000000000
  3     Spin Up Time                      21             96     0x00000000175f
  4     Start/Stop Count                  40             96     0x00000000110f
  5     Reallocated Sector Count         140            196     0x00000000003a
  7     Seek Error Rate                   51            200     0x000000000000
  9     Power-on Hours Count               0             80     0x000000003a71
 10     Spin Retry Count                  51            100     0x000000000000
 11     Unknown                           51            100     0x000000000000
 12     Device Power Cycle Count           0             99     0x000000000585
196     Reallocation Event Count           0            181     0x000000000013
197     Current Pending Sector Count       0            200     0x000000000000
198     Off-line Scan Uncorrectable Sect   0            200     0x000000000000
199     Ultra DMA CRC Error Count          0            200     0x000000000001
200     Unknown                           51            200     0x000000000000
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr
Attributes table revision: 16
ID      Attribute name                  Threshold       Value   Raw
  3     Spin Up Time                      63            200     0x000000001b3c
  4     Start/Stop Count                   0            253     0x000000000020
  5     Reallocated Sector Count          63            253     0x000000000000
  6     Unknown                          100            253     0x000000000000
  7     Seek Error Rate                    0            253     0x000000000000
  8     Seek Time Performance            187            253     0x00000000aa64
  9     Power-on Hours Count               0            217     0x00000000b2b8
 10     Spin Retry Count                 157            253     0x000000000000
 11     Unknown                          223            253     0x000000000000
 12     Device Power Cycle Count           0            253     0x00000000003b
192     Power-off Retract Count            0            253     0x000000000000
193     Load Cycle Count                   0            253     0x000000000000
194     Temperature                        0            253     0x00000000001f
195     Unknown                            0            253     0x000000009dba
196     Reallocation Event Count           0            253     0x000000000000
197     Current Pending Sector Count       0            253     0x000000000000
198     Off-line Scan Uncorrectable Sect   0            253     0x000000000000
199     Ultra DMA CRC Error Count          0            199     0x000000000000
200     Unknown                            0            253     0x000000000000
201     Unknown                            0            253     0x00000000014e
202     Unknown                            0            253     0x000000000000
203     Unknown                          180            253     0x000000000008
204     Unknown                            0            253     0x000000000000
205     Unknown                            0            253     0x000000000000
207     Unknown                            0            253     0x000000000000
208     Unknown                            0            253     0x000000000000
209     Unknown                            0            253     0x000000000000
 99     Unknown                            0            253     0x000000000000
100     Unknown                            0            253     0x000000000000
101     Unknown                            0            253     0x000000000000
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd3 readattr
Attributes table revision: 16
ID      Attribute name                  Threshold       Value   Raw
  3     Spin Up Time                      63            204     0x00000000330f
  4     Start/Stop Count                   0            253     0x000000000041
  5     Reallocated Sector Count          63            253     0x000000000000
  6     Unknown                          100            253     0x000000000000
  7     Seek Error Rate                    0            253     0x000000000000
  8     Seek Time Performance            187            253     0x00000000c738
  9     Power-on Hours Count               0            211     0x000000006ace
 10     Spin Retry Count                 157            253     0x000000000000
 11     Unknown                          223            253     0x000000000000
 12     Device Power Cycle Count           0            253     0x000000000063
192     Power-off Retract Count            0            253     0x000000000000
193     Load Cycle Count                   0            253     0x000000000000
194     Temperature                        0            253     0x000000000024
195     Unknown                            0            253     0x000000000ca3
196     Reallocation Event Count           0            253     0x000000000000
197     Current Pending Sector Count       0            253     0x000000000000
198     Off-line Scan Uncorrectable Sect   0            253     0x000000000000
199     Ultra DMA CRC Error Count          0            199     0x000000000000
200     Unknown                            0            253     0x000000000000
201     Unknown                            0            253     0x000000000000
202     Unknown                            0            253     0x000000000000
203     Unknown                          180            253     0x000000000000
204     Unknown                            0            253     0x000000000000
205     Unknown                            0            253     0x000000000000
207     Unknown                            0            253     0x000000000000
208     Unknown                            0            253     0x000000000000
209     Unknown                            0            193     0x000000000000
 99     Unknown                            0            253     0x000000000000
100     Unknown                            0            253     0x000000000000
101     Unknown                            0            253     0x000000000000
[EMAIL PROTECTED]:$ 

I'm not sure what to believe in all of this.  The only thing I can clearly
state is that wd1 appears to be going bad, but I can't tell a good way to
be alerted of this fact prior to actually getting input/output errors in
the filesystem.  What's the best way to do this short of monitoring?


-- 
Derick Siddoway      And so, the children of the revolution were faced with the
[EMAIL PROTECTED]  age-old problem: it wasn't that you had the wrong kind of 
                     government, which was obvious, but that you had the wrong
                     kind of people.  ( Terry Pratchett, "Night Watch" )

detecting bad disks

Reply via email to