PERC 5/I problems...

Bond Masuda Sun, 24 Oct 2010 19:06:15 -0700

Hello List,

This past month, I've seen some unusual activity on a PERC 5/i with 8x SATA
drives (500GB enterprise drives) in RAID-5 (vd0). Earlier this month, I
noticed drive 0 (pd0) got marked as "failed" and vd0 became degraded.
Strangely, although the array was marked as "degraded" VD0 was no longer
accessible. The controller was in some critical state and locked up;
fortunately the OS is not installed on the PERC 5/I. The drive itself
appeared to be fine, so after a reboot into the PERC bios I had it rebuild
itself which completed fine and vd0 was back to 'optimal'. At the time, I
looked at the SMART data from pd0 and didn't see any errors at all. I didn't
look further into it at the time.


Today, pd2 got marked as failed and vd0 was degraded again and again, the
controller went into critical state and VD0 was no longer accessible even
though omreport showed it only in 'degraded' state. Again, after a reboot I
set pd2 to rebuild itself and that succeeded without issue. vd0 is now
'optimal' again. Since this happened twice in one month, i decided to look
closer this time. I exported the log from the controller and have pasted
some of the relevant info below.

It appears that during patrol reads and consistency checks (scheduled to run
at the beginning of each month), I seem to be getting medium errors on pd4.
i've looked further back in the logs for sept and aug and similarly, i've
seen some errors that were recovered on pd4.

Here is one of the errors on pd4 during the monthly consistency check:

10/01/10  3:51:29: EVT#22851-10/01/10  3:51:29:  65=Consistency Check
progress on VD 00/0 is 74.86%(6735s)
10/01/10  3:51:40: DEV_REC:Medium Error DevId[4] Tgt 4 retires=0
10/01/10  3:51:40: ErrLBAOffset (31) LBA(2b968f00) BadLba=2b968f31
10/01/10  3:51:40: EVT#22852-10/01/10  3:51:40: 113=Unexpected sense: PD
04(e0/s4), CDB: 28 00 2b 96 8f 00 00 01 00 00, Sense: f0 00 03 2b 96 8f 31
0a 00 00 00 00 11 00 00 00 00 0
10/01/10  3:51:40: EVT#22853-10/01/10  3:51:40:  57=Consistency Check
corrected medium error (VD 00/0 at 2b968f31, PD 04(e0/s4) at 2b968f31)
10/01/10  3:51:56: EVT#22854-10/01/10  3:51:56:  65=Consistency Check
progress on VD 00/0 is 75.11%(6762s)


Below are the messages from the patrol read that had a problem with pd4 too:


10/03/10 10:43:58: EVT#23025-10/03/10 10:43:58:  94=Patrol Read progress on
PD 01(e0/s1) is 70.00%(6533s)
10/03/10 10:52:13: DEV_REC:Medium Error DevId[4] Tgt 4 retires=0
10/03/10 10:52:13: ErrLBAOffset (2616) LBA(2b968000) BadLba=2b96a616
10/03/10 10:52:13: prCallback: Medium Error on pd=04, StartLba=2b968000,
ErrLba=2b96a616
10/03/10 10:52:13: prRecQueue: starting pd=04 recovery - blocking host
commands
10/03/10 10:52:13: EVT#23026-10/03/10 10:52:13: 113=Unexpected sense: PD
04(e0/s4), CDB: 2f 00 2b 96 80 00 00 80 00 00, Sense: f0 00 03 2b 96 a6 16
0a 00 00 00 00 11 00 00 00 00 0
10/03/10 10:52:14: prRecGo: Ready to attempt recovery errLBA=2b96a616 on
pd=04
10/03/10 10:52:14: prGetLDInfo: MediaErr in ld=0, span=0, arm=4
10/03/10 10:52:14: prRecGo: dataErr found on ld 0 span 0 arm 4
10/03/10 10:52:14: prRecGo: data NOT in cache; cacheLn=ffffffff, row=2b96a6,
stripe=1311e8c, refBlk=16, type=0
10/03/10 10:52:14: prRecGo: R5-get cacheLn=f4, c_ptr=a1429700 mem=a13e6150 &
setup cInx=49a c=a0dbcfe0
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=1 mem=a140aa14 stripe=2b96a6
type=0 i=1 status=0
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=2 mem=a140aa28 stripe=1311e8a
type=0 i=0 status=1
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=3 mem=a140aa3c stripe=1311e8b
type=0 i=0 status=2
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=5 mem=a140aa64 stripe=1311e8d
type=0 i=0 status=4
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=6 mem=a140aa78 stripe=1311e8e
type=0 i=0 status=5
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=7 mem=a140aa8c stripe=1311e8f
type=0 i=0 status=6
10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=0 mem=a140aa00 stripe=1311e90
type=0 i=0 status=7
10/03/10 10:52:14: Issuing write verify pd=04, arm=4, span=0, blk=2b96a616
10/03/10 10:52:14: EVT#23027-10/03/10 10:52:14: 110=Corrected medium error
during recovery on PD 04(e0/s4) at 2b96a616
10/03/10 10:52:14: EVT#23028-10/03/10 10:52:14:  93=Patrol Read corrected
medium error on PD 04(e0/s4) at 2b96a616
10/03/10 10:58:51: EVT#23029-10/03/10 10:58:51:  94=Patrol Read progress on
PD 05(e0/s5) is 79.99%(7426s)


Here's another error on pd4, but then we find problems on pd0 and pd2.
Eventually PD0 is marked as failed (this was the 1st event of the month) and
VD0 becomes degraded and the controller goes critical and locks up.


10/10/10 11:05:48: EVT#23129-10/10/10 11:05:48:  94=Patrol Read progress on
PD 04(e0/s4) is 70.00%(6360s)
10/10/10 11:14:58: DEV_REC:Medium Error DevId[4] Tgt 4 retires=0
10/10/10 11:14:58: ErrLBAOffset (36b9) LBA(2b968000) BadLba=2b96b6b9
10/10/10 11:14:58: EVT#23130-10/10/10 11:14:58: 113=Unexpected sense: PD
04(e0/s4), CDB: 2f 00 2b 96 80 00 00 80 00 00, Sense: f0 00 03 2b 96 b6 b9
0a 00 00 00 00 11 00 00 00 00 0
10/10/10 11:14:58: prCallback: Medium Error on pd=04, StartLba=2b968000,
ErrLba=2b96b6b9
10/10/10 11:14:58: prRecQueue: starting pd=04 recovery - blocking host
commands
10/10/10 11:14:58: prRecGo: Ready to attempt recovery errLBA=2b96b6b9 on
pd=04
10/10/10 11:14:59: prGetLDInfo: MediaErr in ld=0, span=0, arm=4
10/10/10 11:14:59: prRecGo: dataErr found on ld 0 span 0 arm 4
10/10/10 11:14:59: prRecGo: data NOT in cache; cacheLn=ffffffff, row=2b96b6,
stripe=1311efc, refBlk=b9, type=0
10/10/10 11:14:59: prRecGo: R5-get cacheLn=541, c_ptr=a145d0c0 mem=a1405550
& setup cInx=562 c=a0dc65e0
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=1 mem=a13ef214 stripe=2b96b6
type=0 i=1 status=0
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=2 mem=a13ef228 stripe=1311efa
type=0 i=0 status=1
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=3 mem=a13ef23c stripe=1311efb
type=0 i=0 status=2
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=5 mem=a13ef264 stripe=1311efd
type=0 i=0 status=4
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=6 mem=a13ef278 stripe=1311efe
type=0 i=0 status=5
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=7 mem=a13ef28c stripe=1311eff
type=0 i=0 status=6
10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=0 mem=a13ef200 stripe=1311f00
type=0 i=0 status=7
10/10/10 11:14:59: Issuing write verify pd=04, arm=4, span=0, blk=2b96b6b9
10/10/10 11:14:59: EVT#23131-10/10/10 11:14:59: 110=Corrected medium error
during recovery on PD 04(e0/s4) at 2b96b6b9
10/10/10 11:14:59: EVT#23132-10/10/10 11:14:59:  93=Patrol Read corrected
medium error on PD 04(e0/s4) at 2b96b6b9
10/10/10 11:22:22: EVT#23133-10/10/10 11:22:22:  94=Patrol Read progress on
PD 06(e0/s6) is 79.99%(7354s)
10/10/10 11:23:32: EVT#23134-10/10/10 11:23:32:  94=Patrol Read progress on
PD 02(e0/s2) is 79.99%(7424s)
10/10/10 11:23:55: EVT#23135-10/10/10 11:23:55:  94=Patrol Read progress on
PD 07(e0/s7) is 79.99%(7447s)
10/10/10 11:24:24: EVT#23136-10/10/10 11:24:24:  94=Patrol Read progress on
PD 05(e0/s5) is 79.99%(7476s)
10/10/10 11:24:33: EVT#23137-10/10/10 11:24:33:  94=Patrol Read progress on
PD 00(e0/s0) is 79.99%(7485s)
10/10/10 11:24:38: EVT#23138-10/10/10 11:24:38:  94=Patrol Read progress on
PD 03(e0/s3) is 79.99%(7490s)
10/10/10 11:25:12: EVT#23139-10/10/10 11:25:12:  94=Patrol Read progress on
PD 04(e0/s4) is 79.99%(7524s)
10/10/10 11:25:18: EVT#23140-10/10/10 11:25:18:  94=Patrol Read progress on
PD 01(e0/s1) is 79.99%(7530s)
10/10/10 11:43:19: EVT#23141-10/10/10 11:43:19:  94=Patrol Read progress on
PD 06(e0/s6) is 89.99%(8611s)
10/10/10 11:44:42: EVT#23142-10/10/10 11:44:42:  94=Patrol Read progress on
PD 02(e0/s2) is 89.99%(8694s)
10/10/10 11:45:28: EVT#23143-10/10/10 11:45:28:  94=Patrol Read progress on
PD 05(e0/s5) is 89.99%(8740s)
10/10/10 11:45:34: EVT#23144-10/10/10 11:45:34:  94=Patrol Read progress on
PD 07(e0/s7) is 89.99%(8746s)
10/10/10 11:45:37: EVT#23145-10/10/10 11:45:37:  94=Patrol Read progress on
PD 00(e0/s0) is 89.99%(8749s)
10/10/10 11:46:25: EVT#23146-10/10/10 11:46:25:  94=Patrol Read progress on
PD 03(e0/s3) is 89.99%(8797s)
10/10/10 11:46:53: EVT#23147-10/10/10 11:46:53:  94=Patrol Read progress on
PD 01(e0/s1) is 89.99%(8825s)
10/10/10 11:47:10: EVT#23148-10/10/10 11:47:10:  94=Patrol Read progress on
PD 04(e0/s4) is 89.99%(8842s)
10/10/10 11:57:31: DEV_REC:Medium Error DevId[2] Tgt 2 retires=0
10/10/10 11:57:31: ErrLBAOffset (0) LBA(37c50000) BadLba=37c50000
10/10/10 11:57:31: prCallback: Medium Error on pd=02, StartLba=37c50000,
ErrLba=37c50000
10/10/10 11:57:31: prRecQueue: starting pd=02 recovery - blocking host
commands
10/10/10 11:57:31: EVT#23149-10/10/10 11:57:31: 113=Unexpected sense: PD
02(e0/s2), CDB: 2f 00 37 c5 00 00 00 80 00 00, Sense: f0 00 03 37 c5 00 00
0a 00 00 00 00 11 00 00 00 00 0
10/10/10 11:57:31: prRecGo: Ready to attempt recovery errLBA=37c50000 on
pd=02
10/10/10 11:57:31: prGetLDInfo: MediaErr in ld=0, span=0, arm=2
10/10/10 11:57:31: prRecGo: dataErr found on ld 0 span 0 arm 2
10/10/10 11:57:31: prRecGo: data NOT in cache; cacheLn=ffffffff, row=37c500,
stripe=1866302, refBlk=0, type=0
10/10/10 11:57:31: prRecGo: R5-get cacheLn=a24, c_ptr=a1497b00 mem=a13fdfa8
& setup cInx=533 c=a0dc42a0
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=7 mem=a13fb30c stripe=37c500
type=0 i=1 status=0
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=0 mem=a13fb280 stripe=1866300
type=0 i=0 status=1
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=1 mem=a13fb294 stripe=1866301
type=0 i=0 status=2
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=3 mem=a13fb2bc stripe=1866303
type=0 i=0 status=4
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=4 mem=a13fb2d0 stripe=1866304
type=0 i=0 status=5
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=5 mem=a13fb2e4 stripe=1866305
type=0 i=0 status=6
10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=6 mem=a13fb2f8 stripe=1866306
type=0 i=0 status=7
10/10/10 11:57:39: DEV_REC:Medium Error DevId[0] Tgt 0 retires=0
10/10/10 11:57:39: ErrLBAOffset (0) LBA(37c50000) BadLba=37c50000
10/10/10 11:57:39: EVT#23150-10/10/10 11:57:39: 113=Unexpected sense: PD
00(e0/s0), CDB: 28 00 37 c5 00 00 00 00 01 00, Sense: f0 00 03 37 c5 00 00
0a 00 00 00 00 11 00 00 00 00 0
10/10/10 11:57:39: org resgnBlk=0
10/10/10 11:57:39: EVT#23151-10/10/10 11:57:39:  95=Patrol Read found an
uncorrectable medium error on PD 02(e0/s2) at 37c50000
10/10/10 11:57:55: DEV_REC:Medium Error DevId[0] Tgt 0 retires=0
10/10/10 11:57:55: ErrLBAOffset (0) LBA(379d0000) BadLba=379d0000
10/10/10 11:57:55: prCallback: Medium Error on pd=00, StartLba=379d0000,
ErrLba=379d0000
10/10/10 11:57:55: prRecQueue: starting pd=00 recovery - blocking host
commands
10/10/10 11:57:55: EVT#23152-10/10/10 11:57:55: 113=Unexpected sense: PD
00(e0/s0), CDB: 2f 00 37 9d 00 00 00 80 00 00, Sense: f0 00 03 37 9d 00 00
0a 00 00 00 00 11 00 00 00 00 0
10/10/10 11:57:56: DEV_REC:Medium Error DevId[2] Tgt 2 retires=0
10/10/10 11:57:56: ErrLBAOffset (0) LBA(37c58001) BadLba=37c58001
10/10/10 11:57:56: EVT#23153-10/10/10 11:57:56: 113=Unexpected sense: PD
02(e0/s2), CDB: 2f 00 37 c5 80 01 00 80 00 00, Sense: f0 00 03 37 c5 80 01
0a 00 00 00 00 11 00 00 00 00 0
10/10/10 11:57:56: prCallback: Medium Error on pd=02, StartLba=37c58001,
ErrLba=37c58001
10/10/10 11:57:56: prRecQueue: adding pd=02 to recovery wait queue
10/10/10 11:57:56: prRecGo: Ready to attempt recovery errLBA=379d0000 on
pd=00
10/10/10 11:57:56: prGetLDInfo: MediaErr in ld=0, span=0, arm=0
10/10/10 11:57:56: prRecGo: dataErr found on ld 0 span 0 arm 0
10/10/10 11:57:56: prRecGo: data NOT in cache; cacheLn=ffffffff, row=379d00,
stripe=1854b00, refBlk=0, type=0
10/10/10 11:57:56: prRecGo: R5-get cacheLn=105, c_ptr=a142a3c0 mem=a13fdd00
& setup cInx=532 c=a0dc41e0
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=7 mem=a140940c stripe=379d00
type=0 i=1 status=0
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=1 mem=a1409394 stripe=1854b01
type=0 i=0 status=2
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=2 mem=a14093a8 stripe=1854b02
type=0 i=0 status=3
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=3 mem=a14093bc stripe=1854b03
type=0 i=0 status=4
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=4 mem=a14093d0 stripe=1854b04
type=0 i=0 status=5
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=5 mem=a14093e4 stripe=1854b05
type=0 i=0 status=6
10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=6 mem=a14093f8 stripe=1854b06
type=0 i=0 status=7
10/10/10 11:57:56: Issuing write verify pd=00, arm=0, span=0, blk=379d0000
10/10/10 11:57:56: EVT#23154-10/10/10 11:57:56: 110=Corrected medium error
during recovery on PD 00(e0/s0) at 379d0000
10/10/10 11:58:03: DEV_REC:Medium Error DevId[0] Tgt 0 retires=0
10/10/10 11:58:03: ErrLBAOffset (0) LBA(379d0000) BadLba=379d0000
10/10/10 11:58:03: Write MED ERR!!! ErrLBA(0) LBA(379d0000)
10/10/10 11:58:03: EVT#23155-10/10/10 11:58:03: 113=Unexpected sense: PD
00(e0/s0), CDB: 2e 00 37 9d 00 00 00 00 01 00, Sense: f0 00 03 37 9d 00 00
0a 00 00 00 00 11 00 00 00 00 0
10/10/10 11:58:03: EVT#23156-10/10/10 11:58:03: 108=Reassign write operaiton
failed on PD 00(e0/s0) at 9d370000
10/10/10 11:58:03: EVT#23157-10/10/10 11:58:03:  87=Error on PD 00(e0/s0)
(Error 02)
10/10/10 11:58:03: EVT#23158-10/10/10 11:58:03:  81=State change on VD 00/0
from OPTIMAL(3) to DEGRADED(2)
10/10/10 11:58:03: EVT#23159-10/10/10 11:58:03: 251=VD 00/0 is now DEGRADED
10/10/10 11:58:03: EVT#23160-10/10/10 11:58:03: 114=State change on PD
00(e0/s0) from ONLINE(18) to FAILED(11)
10/10/10 11:58:03: EVT#23161-10/10/10 11:58:03: 108=Reassign write operaiton
failed on PD 00(e0/s0) at 379d0000
10/10/10 11:58:03: EVT#23162-10/10/10 11:58:03:  93=Patrol Read corrected
medium error on PD 00(e0/s0) at 379d0000
10/10/10 11:58:03: prRecGo: Ready to attempt recovery errLBA=37c58001 on
pd=02
10/10/10 11:58:03: prGetLDInfo: MediaErr in ld=0, span=0, arm=2
10/10/10 11:58:03: prRecGo: dataErr found on ld 0 span 0 arm 2
10/10/10 11:58:03: prRecGo: no correction for degraded LD 0 - continue
10/10/10 11:58:03: prDiskCheckOkToRun: PR cannot run on this pd=0 not a
spare and online
10/10/10 11:58:04: DM_ProcessMsg: DevState UnKnown DevId 0 Flags f0400005
Rdm a04a2400 
10/10/10 11:58:04:  MPT_ProcessIo: SMP/STP Completed without ReplyFrame Rdm
a04a2400 Cmd 9 
10/10/10 17:21:38: EVT#23193-10/10/10 17:21:38:  44=Time established as
10/10/10 17:21:38; (18 seconds since power on)
10/10/10 17:21:38: LOAD section: src=9e974184, size=1a88, dst=0,
mode=1...done
10/10/10 17:21:38: LOAD section: src=9e940060, size=32808, dst=a0200000,
mode=1...done
10/10/10 17:21:39: LOAD section: src=9e972871, size=190a, dst=a02850c0,
mode=1...done
10/10/10 17:21:39: CTLR version 1.04-019A Date Aug 13 2007 Time 23:21:34
10/10/10 17:21:39: Vendor Id: 15 SubVId: 1f03 DevId: 1028 SubDevId: 1028
10/10/10 17:22:36: EVT#23194-10/10/10 17:22:36: 149=Battery temperature is
normal
10/10/10 17:22:36: EVT#23195-10/10/10 17:22:36: 147=Battery started charging
10/10/10 17:22:36: EVT#23196-10/10/10 17:22:36: 163=Current capacity of the
battery is above threshold
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=01
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=02
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=03
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=04
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=05
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=06
10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=07
10/10/10 17:22:38: EVT#23197-10/10/10 17:22:38:  38=Patrol Read resumed
10/10/10 17:22:54: EVT#23198-10/10/10 17:22:54:  94=Patrol Read progress on
PD 06(e0/s6) is 96.59%(16s)
10/10/10 17:22:54: EVT#23199-10/10/10 17:22:54:  94=Patrol Read progress on
PD 04(e0/s4) is 95.28%(16s)
10/10/10 17:22:54: EVT#23200-10/10/10 17:22:54:  94=Patrol Read progress on
PD 03(e0/s3) is 95.57%(16s)
10/10/10 17:22:54: EVT#23201-10/10/10 17:22:54:  94=Patrol Read progress on
PD 07(e0/s7) is 95.98%(16s)
10/10/10 17:22:54: EVT#23202-10/10/10 17:22:54:  94=Patrol Read progress on
PD 05(e0/s5) is 95.93%(16s)
10/10/10 17:22:54: EVT#23203-10/10/10 17:22:54:  94=Patrol Read progress on
PD 02(e0/s2) is 96.02%(16s)
10/10/10 17:22:54: EVT#23204-10/10/10 17:22:54:  94=Patrol Read progress on
PD 01(e0/s1) is 95.39%(16s)
10/10/10 17:23:44: EVT#23205-10/10/10 17:23:44: 105=Rebuild started on PD
00(e0/s0)
10/10/10 17:23:44: EVT#23206-10/10/10 17:23:44: 114=State change on PD
00(e0/s0) from FAILED(11) to REBUILD(14)
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=6 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=06 - state changed
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=2 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=02 - state changed
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=3 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=03 - state changed
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=7 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=07 - state changed
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=1 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=01 - state changed
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=5 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=05 - state changed
10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=4 this
array=0 is rebuilding
10/10/10 17:23:44: prCallback: PR being stopped for pd=04 - state changed
10/10/10 17:23:44: PR cycle complete
10/10/10 17:23:44: EVT#23207-10/10/10 17:23:44:  35=Patrol Read complete
10/10/10 17:23:44: Next PR scheduled to start at 10/17/10  9:19:48
10/10/10 17:24:34: EVT#23208-10/10/10 17:24:34: 103=Rebuild progress on PD
00(e0/s0) is 0.99%(50s)


You can see above where i started the rebuild of PD0. That succeeded without
an issue. Something similar to the above happened today with PD2, which was
indicated previously too except that PD0 was the one that got marked failed.
All of these incidents seem to happen soon after an issue/medium error
recovery on PD4. And there are various errors in prior months related to
PD4.

My question is, are the issues with PD0 or PD2 going into "failed" state
really a symptom of the issues going on with PD4? Looking at the SMART data
for PD4, i noticed that there are 5 sectors in "pending" state. I believe
this means they are pending reallocation? All other drives 0-3,5-7 show no
such issues in their SMART data.

My gut thinks the issue is really with PD4 and if i can resolve that, I
won't have these mysterious FAILED drive events anymore. I'm in the process
of verifying the last backup is good. Once I verify that, I plan to force
PD4 to rebuild in hopes that during the rebuild the "pending" 5 sectors will
get corrected and actually reallocated.

I know there are some guys on this list that know more than me and have more
experience... so I just wanted to share my thoughts/suspicions about PD4 and
get your opinion. What do you think? PD4 the culprit or beware of PD0 and
PD2 too?

TIA,
-Bond


_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

PERC 5/I problems...

Reply via email to