Trying to copy a file from one filesystem to another, I kept getting
input/output errors. I noticed these messages in the logs:
wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416;
cn 762 tn 5 sn 5), retrying
wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416;
cn 762 tn 5 sn 5), retrying
wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416;
cn 762 tn 5 sn 5), retrying
wd1a: uncorrectable data error reading fsbn 768416 of 768384-0 (wd1 bn 768416;
cn 762 tn 5 sn 5), retrying
wd1a: uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417;
cn 762 tn 5 sn 6), retrying
wd1a: uncorrectable data error reading fsbn 768417 of 768384-0 (wd1 bn 768417;
cn 762 tn 5 sn 6)
Okay, so clearly wd1 has some issues (wd1a is the only filesystem on that
disk). I've already started moving the data to a different disk.
Now, I thought I was going to be alerted to this sort of thing automatically
because of an entry like this one in the crontab:
0 * * * * /sbin/atactl /dev/wd0c smartstatus >/dev/null
However, when I run this by hand, I get
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 smartstatus
No SMART threshold exceeded
So clearly, the SMART stuff wasn't going to tell me about this.
However:
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd1 readattr
Attributes table revision: 16
ID Attribute name Threshold Value Raw
1 Raw Read Error Rate 51 199 0x000000000081
3 Spin Up Time 21 123 0x000000001127
4 Start/Stop Count 40 99 0x00000000056f
5 Reallocated Sector Count 140 200 0x000000000000
7 Seek Error Rate 51 200 0x000000000000
9 Power-on Hours Count 0 73 0x000000004da4
10 Spin Retry Count 51 100 0x000000000000
11 Unknown 51 100 0x000000000000
12 Device Power Cycle Count 0 99 0x00000000056e
194 Temperature 0 101 0x000000000031
196 Reallocation Event Count 0 200 0x000000000000
197 Current Pending Sector Count 0 197 0x000000000068
198 Off-line Scan Uncorrectable Sect 0 199 0x000000000032
199 Ultra DMA CRC Error Count 0 200 0x000000000000
I see a number of values that exceed the preset threshholds.
But I see the same kinds of values on the other three drives:
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd0 readattr
Attributes table revision: 16
ID Attribute name Threshold Value Raw
1 Raw Read Error Rate 51 200 0x000000000000
3 Spin Up Time 21 96 0x00000000175f
4 Start/Stop Count 40 96 0x00000000110f
5 Reallocated Sector Count 140 196 0x00000000003a
7 Seek Error Rate 51 200 0x000000000000
9 Power-on Hours Count 0 80 0x000000003a71
10 Spin Retry Count 51 100 0x000000000000
11 Unknown 51 100 0x000000000000
12 Device Power Cycle Count 0 99 0x000000000585
196 Reallocation Event Count 0 181 0x000000000013
197 Current Pending Sector Count 0 200 0x000000000000
198 Off-line Scan Uncorrectable Sect 0 200 0x000000000000
199 Ultra DMA CRC Error Count 0 200 0x000000000001
200 Unknown 51 200 0x000000000000
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd2 readattr
Attributes table revision: 16
ID Attribute name Threshold Value Raw
3 Spin Up Time 63 200 0x000000001b3c
4 Start/Stop Count 0 253 0x000000000020
5 Reallocated Sector Count 63 253 0x000000000000
6 Unknown 100 253 0x000000000000
7 Seek Error Rate 0 253 0x000000000000
8 Seek Time Performance 187 253 0x00000000aa64
9 Power-on Hours Count 0 217 0x00000000b2b8
10 Spin Retry Count 157 253 0x000000000000
11 Unknown 223 253 0x000000000000
12 Device Power Cycle Count 0 253 0x00000000003b
192 Power-off Retract Count 0 253 0x000000000000
193 Load Cycle Count 0 253 0x000000000000
194 Temperature 0 253 0x00000000001f
195 Unknown 0 253 0x000000009dba
196 Reallocation Event Count 0 253 0x000000000000
197 Current Pending Sector Count 0 253 0x000000000000
198 Off-line Scan Uncorrectable Sect 0 253 0x000000000000
199 Ultra DMA CRC Error Count 0 199 0x000000000000
200 Unknown 0 253 0x000000000000
201 Unknown 0 253 0x00000000014e
202 Unknown 0 253 0x000000000000
203 Unknown 180 253 0x000000000008
204 Unknown 0 253 0x000000000000
205 Unknown 0 253 0x000000000000
207 Unknown 0 253 0x000000000000
208 Unknown 0 253 0x000000000000
209 Unknown 0 253 0x000000000000
99 Unknown 0 253 0x000000000000
100 Unknown 0 253 0x000000000000
101 Unknown 0 253 0x000000000000
[EMAIL PROTECTED]:$ sudo /sbin/atactl /dev/wd3 readattr
Attributes table revision: 16
ID Attribute name Threshold Value Raw
3 Spin Up Time 63 204 0x00000000330f
4 Start/Stop Count 0 253 0x000000000041
5 Reallocated Sector Count 63 253 0x000000000000
6 Unknown 100 253 0x000000000000
7 Seek Error Rate 0 253 0x000000000000
8 Seek Time Performance 187 253 0x00000000c738
9 Power-on Hours Count 0 211 0x000000006ace
10 Spin Retry Count 157 253 0x000000000000
11 Unknown 223 253 0x000000000000
12 Device Power Cycle Count 0 253 0x000000000063
192 Power-off Retract Count 0 253 0x000000000000
193 Load Cycle Count 0 253 0x000000000000
194 Temperature 0 253 0x000000000024
195 Unknown 0 253 0x000000000ca3
196 Reallocation Event Count 0 253 0x000000000000
197 Current Pending Sector Count 0 253 0x000000000000
198 Off-line Scan Uncorrectable Sect 0 253 0x000000000000
199 Ultra DMA CRC Error Count 0 199 0x000000000000
200 Unknown 0 253 0x000000000000
201 Unknown 0 253 0x000000000000
202 Unknown 0 253 0x000000000000
203 Unknown 180 253 0x000000000000
204 Unknown 0 253 0x000000000000
205 Unknown 0 253 0x000000000000
207 Unknown 0 253 0x000000000000
208 Unknown 0 253 0x000000000000
209 Unknown 0 193 0x000000000000
99 Unknown 0 253 0x000000000000
100 Unknown 0 253 0x000000000000
101 Unknown 0 253 0x000000000000
[EMAIL PROTECTED]:$
I'm not sure what to believe in all of this. The only thing I can clearly
state is that wd1 appears to be going bad, but I can't tell a good way to
be alerted of this fact prior to actually getting input/output errors in
the filesystem. What's the best way to do this short of monitoring?
--
Derick Siddoway And so, the children of the revolution were faced with the
[EMAIL PROTECTED] age-old problem: it wasn't that you had the wrong kind of
government, which was obvious, but that you had the wrong
kind of people. ( Terry Pratchett, "Night Watch" )