Сбой диска и smartctl + badblocks

Serge Petruchok Mon, 21 Mar 2011 12:17:28 -0700

Доброго времени всем.

Будьте любезны, проконсультируйте.


Имеется свеже-установленный Debian Squeeze с новыми винтами WD1002FAEX-00Z3A0 
(WD Black).
Вчера один из дисков вылетел из софтверного raid-1.

Тесты smartctl -t long|short /dev/sdb почти сразу же заканчивались с ошибками, 
значение Current_Pending_Sector  увеличилось сначала до 4, потом до 17, при этом
Reallocated_Event_Count показывает 0

Ребутнул тазик и добавил в груб "acpi=off noapic".
Запустил "badblock -w ..." с проверкой указанных в smart-таблице участков, 
ошибок не показало.
Еще раз протестировал smartctl -t, в результате Current_Pending_Sector стал 
показывать "0", тест винта прошел нормально.

Сделал "dd if=/dev/sdb of=/dev/null" - ни разу не ругнулся.

Вот и думаю - менять диск сразу ? Или это баг с acpi (мать gigabyte s775), или 
питание/шлейфы ?
Если диск все же битый, тогда почему сейчас тест смарта пишет, что все 
нормально и ошибок нет ?

PS. прошу прощение за длинное письмо

Состояние смарта на сейчас:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0027   228   173   021    Pre-fail  Always       
-       1575
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       146
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       
-       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       148
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       
-       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       144
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       
-       123
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       
-       22
194 Temperature_Celsius     0x0022   114   108   000    Old_age   Always       
-       33
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      
-       0

SMART Error Log Version: 1
ATA Error Count: 24 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 24 occurred at disk power-on lifetime: 131 hours (5 days + 11 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 d8 63 6c e9  Error: UNC 8 sectors at LBA = 0x096c63d8 = 158098392

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d8 63 6c e9 08      00:00:47.284  READ DMA
  ec 00 00 00 00 00 a0 08      00:00:47.274  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 08      00:00:47.274  SET FEATURES [Set transfer mode]

Error 23 occurred at disk power-on lifetime: 131 hours (5 days + 11 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 d8 63 6c e9  Error: UNC 8 sectors at LBA = 0x096c63d8 = 158098392

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d8 63 6c e9 08      00:00:45.469  READ DMA
  ec 00 00 00 00 00 a0 08      00:00:45.459  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 08      00:00:45.452  SET FEATURES [Set transfer mode]

Error 22 occurred at disk power-on lifetime: 131 hours (5 days + 11 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 d8 63 6c e9  Error: UNC 8 sectors at LBA = 0x096c63d8 = 158098392

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d8 63 6c e9 08      00:00:43.646  READ DMA
  ec 00 00 00 00 00 a0 08      00:00:43.636  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 08      00:00:43.630  SET FEATURES [Set transfer mode]

Error 21 occurred at disk power-on lifetime: 131 hours (5 days + 11 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 d8 63 6c e9  Error: UNC 8 sectors at LBA = 0x096c63d8 = 158098392

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d8 63 6c e9 08      00:00:41.823  READ DMA
  ec 00 00 00 00 00 a0 08      00:00:41.814  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 08      00:00:41.807  SET FEATURES [Set transfer mode]

Error 20 occurred at disk power-on lifetime: 131 hours (5 days + 11 hours)
  When the command that caused the error occurred, the device was active or 
idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 08 d8 63 6c e9  Error: UNC 8 sectors at LBA = 0x096c63d8 = 158098392

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 08 d8 63 6c e9 08      00:00:39.998  READ DMA
  ec 00 00 00 00 00 a0 08      00:00:39.989  IDENTIFY DEVICE
  ef 03 42 00 00 00 a0 08      00:00:39.981  SET FEATURES [Set transfer mode]

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended offline    Completed without error       00%       136         -
# 2  Short offline       Completed without error       00%       133         -
# 3  Short offline       Completed: read failure       90%       133         
167404513
# 4  Short offline       Completed: read failure       90%       133         
165367582
# 5  Short offline       Completed: read failure       90%       133         
160274282
# 6  Short offline       Completed: read failure       90%       133         
158157715
# 7  Short offline       Completed: read failure       90%       133         
158098392
# 8  Extended offline    Completed: read failure       90%       131         
158098392
# 9  Extended offline    Completed: read failure       90%       131         
158098392
#10  Extended offline    Completed without error       00%        19         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


-- 
С уважением,
Сергей.


-- 
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]
Archive: http://lists.debian.org/[email protected]

Сбой диска и smartctl + badblocks

Ответить