Server: Intel S3000AH with 8GB RAM, 4x160 WDC [1] hdd (RAID10)
(/dev/sdb1, /dev/sdc1, /dev/sdd1, /dev/sde1)
OS: Debian Testing AMD64 (Aug/10 build), 2.6.32-5, smartmontools v5.40
3 days ago, mdadm started reporting raid member failure (/dev/sdb1).
A quick look through /proc/mdstat showed the first device was offline.
Yet, the disk was showing up fine in the BIOS as well with fdisk and
smartctl without any timeouts [3]. I was able to add /dev/sdb1 to the
array manually with the following "mdadm --add /dev/sdb1 /dev/md0"
mdadm rebuilt the array and things went fine for 2 days.
Last night the device /dev/sdb1 went offline again; mdadm complaining
about it being unavailable but it still showed up fine in the BIOS and
fdisk showed /dev/sdb1 with "fd" ID.
However, I was unable to add /dev/sdb1 to the /dev/md0 device this
time. mdadm complained /dev/sdb1 was not an md device!
# fdisk -l /dev/sdb (output shown below)
Disk /dev/sdb: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000bc454
Device Boot Start End Blocks Id System
/dev/sdb1 1 19457 156288321 fd Linux raid autodetect
# mdadm --add /dev/sdb1 /dev/md0 (output shown below)
mdadm: /dev/sdb1 does not appear to be an md device
NB: Nothing in /var/log/messages related to this HDD issue with
# smartctl -a /dev/sdb does show error counts (excerpts of o/p below)
BEGIN =========================================================================
SMART Error Log Version: 1
ATA Error Count: 3
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 3 occurred at disk power-on lifetime: 325 hours (13 days + 13 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 40 Error: ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
24 03 01 00 00 00 00 00 00:00:32.090 READ SECTOR(S) EXT
24 03 01 00 00 00 00 00 00:00:31.186 READ SECTOR(S) EXT
24 03 01 00 00 00 00 00 00:00:30.287 READ SECTOR(S) EXT
c6 03 10 00 00 00 00 00 00:00:26.229 SET MULTIPLE MODE
91 03 3f 00 00 00 0f 00 00:00:26.229 INITIALIZE DEVICE
PARAMETERS [OBS-6]
Error 2 occurred at disk power-on lifetime: 325 hours (13 days + 13 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 40 Error: ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
24 03 01 00 00 00 00 00 00:00:31.186 READ SECTOR(S) EXT
24 03 01 00 00 00 00 00 00:00:30.287 READ SECTOR(S) EXT
c6 03 10 00 00 00 00 00 00:00:26.229 SET MULTIPLE MODE
91 03 3f 00 00 00 0f 00 00:00:26.229 INITIALIZE DEVICE
PARAMETERS [OBS-6]
ef 03 0c 00 00 00 00 00 00:00:26.229 SET FEATURES [Set transfer mode]
Error 1 occurred at disk power-on lifetime: 325 hours (13 days + 13 hours)
When the command that caused the error occurred, the device was
active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
04 51 01 00 00 00 40 Error: ABRT at LBA = 0x00000000 = 0
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
24 03 01 00 00 00 00 00 00:00:30.287 READ SECTOR(S) EXT
c6 03 10 00 00 00 00 00 00:00:26.229 SET MULTIPLE MODE
91 03 3f 00 00 00 0f 00 00:00:26.229 INITIALIZE DEVICE
PARAMETERS [OBS-6]
ef 03 0c 00 00 00 00 00 00:00:26.229 SET FEATURES [Set transfer mode]
ef 03 45 00 00 00 00 00 00:00:26.229 SET FEATURES [Set transfer mode]
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
END =========================================================================
I have seen error reports on other HDDs but ABRT is what concerns me.
I came across quite a few discussions (with search string "Error: ABRT
at LBA =") but none conclusive about the disk condition.
Any ideas / pointers before I RMA this disk to WDC?
[1] Western Digital Caviar Blue Serial ATA family Device Model: WDC
WD1600AAJS-00WAA0
Thanks,
-- Arun Khan
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc