Server: Intel S3000AH with 8GB RAM, 4x160 WDC [1] hdd (RAID10)
(/dev/sdb1, /dev/sdc1, /dev/sdd1, /dev/sde1)
OS: Debian Testing AMD64 (Aug/10 build), 2.6.32-5, smartmontools v5.40

3 days ago, mdadm started reporting raid member failure (/dev/sdb1).
A quick look through /proc/mdstat showed the first device was offline.
  Yet, the disk was showing up fine in the BIOS as well with fdisk and
smartctl without any timeouts [3].  I was able to add /dev/sdb1 to the
array manually with the following "mdadm --add /dev/sdb1 /dev/md0"
mdadm rebuilt the array and things went fine for 2 days.

Last night the device /dev/sdb1 went offline again; mdadm complaining
about it being unavailable but it still showed up fine in the BIOS and
fdisk showed /dev/sdb1 with "fd" ID.

However, I was unable to add /dev/sdb1 to the /dev/md0 device this
time.  mdadm complained /dev/sdb1 was not an md device!

# fdisk -l /dev/sdb  (output shown below)

Disk /dev/sdb: 160.0 GB, 160041885696 bytes
255 heads, 63 sectors/track, 19457 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000bc454

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1               1       19457   156288321   fd  Linux raid autodetect

# mdadm --add /dev/sdb1 /dev/md0  (output shown below)

mdadm: /dev/sdb1 does not appear to be an md device

NB:  Nothing in /var/log/messages related to this HDD issue with

# smartctl -a /dev/sdb does show error counts (excerpts of o/p below)

BEGIN  =========================================================================

SMART Error Log Version: 1
ATA Error Count: 3
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 325 hours (13 days + 13 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  24 03 01 00 00 00 00 00      00:00:32.090  READ SECTOR(S) EXT
  24 03 01 00 00 00 00 00      00:00:31.186  READ SECTOR(S) EXT
  24 03 01 00 00 00 00 00      00:00:30.287  READ SECTOR(S) EXT
  c6 03 10 00 00 00 00 00      00:00:26.229  SET MULTIPLE MODE
  91 03 3f 00 00 00 0f 00      00:00:26.229  INITIALIZE DEVICE
PARAMETERS [OBS-6]

Error 2 occurred at disk power-on lifetime: 325 hours (13 days + 13 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  24 03 01 00 00 00 00 00      00:00:31.186  READ SECTOR(S) EXT
  24 03 01 00 00 00 00 00      00:00:30.287  READ SECTOR(S) EXT
  c6 03 10 00 00 00 00 00      00:00:26.229  SET MULTIPLE MODE
  91 03 3f 00 00 00 0f 00      00:00:26.229  INITIALIZE DEVICE
PARAMETERS [OBS-6]
  ef 03 0c 00 00 00 00 00      00:00:26.229  SET FEATURES [Set transfer mode]

Error 1 occurred at disk power-on lifetime: 325 hours (13 days + 13 hours)
  When the command that caused the error occurred, the device was
active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 01 00 00 00 40  Error: ABRT at LBA = 0x00000000 = 0

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  24 03 01 00 00 00 00 00      00:00:30.287  READ SECTOR(S) EXT
  c6 03 10 00 00 00 00 00      00:00:26.229  SET MULTIPLE MODE
  91 03 3f 00 00 00 0f 00      00:00:26.229  INITIALIZE DEVICE
PARAMETERS [OBS-6]
  ef 03 0c 00 00 00 00 00      00:00:26.229  SET FEATURES [Set transfer mode]
  ef 03 45 00 00 00 00 00      00:00:26.229  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]


SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

END =========================================================================

I have seen error reports on other HDDs but ABRT is what concerns me.

I came across quite a few discussions (with search string "Error: ABRT
at LBA =") but none conclusive about the disk condition.

Any ideas / pointers before I RMA this disk to WDC?

[1] Western Digital Caviar Blue Serial ATA family Device Model: WDC
WD1600AAJS-00WAA0

Thanks,
-- Arun Khan
_______________________________________________
ILUGC Mailing List:
http://www.ae.iitm.ac.in/mailman/listinfo/ilugc

Reply via email to