Hello List, This past month, I've seen some unusual activity on a PERC 5/i with 8x SATA drives (500GB enterprise drives) in RAID-5 (vd0). Earlier this month, I noticed drive 0 (pd0) got marked as "failed" and vd0 became degraded. Strangely, although the array was marked as "degraded" VD0 was no longer accessible. The controller was in some critical state and locked up; fortunately the OS is not installed on the PERC 5/I. The drive itself appeared to be fine, so after a reboot into the PERC bios I had it rebuild itself which completed fine and vd0 was back to 'optimal'. At the time, I looked at the SMART data from pd0 and didn't see any errors at all. I didn't look further into it at the time.
Today, pd2 got marked as failed and vd0 was degraded again and again, the controller went into critical state and VD0 was no longer accessible even though omreport showed it only in 'degraded' state. Again, after a reboot I set pd2 to rebuild itself and that succeeded without issue. vd0 is now 'optimal' again. Since this happened twice in one month, i decided to look closer this time. I exported the log from the controller and have pasted some of the relevant info below. It appears that during patrol reads and consistency checks (scheduled to run at the beginning of each month), I seem to be getting medium errors on pd4. i've looked further back in the logs for sept and aug and similarly, i've seen some errors that were recovered on pd4. Here is one of the errors on pd4 during the monthly consistency check: 10/01/10 3:51:29: EVT#22851-10/01/10 3:51:29: 65=Consistency Check progress on VD 00/0 is 74.86%(6735s) 10/01/10 3:51:40: DEV_REC:Medium Error DevId[4] Tgt 4 retires=0 10/01/10 3:51:40: ErrLBAOffset (31) LBA(2b968f00) BadLba=2b968f31 10/01/10 3:51:40: EVT#22852-10/01/10 3:51:40: 113=Unexpected sense: PD 04(e0/s4), CDB: 28 00 2b 96 8f 00 00 01 00 00, Sense: f0 00 03 2b 96 8f 31 0a 00 00 00 00 11 00 00 00 00 0 10/01/10 3:51:40: EVT#22853-10/01/10 3:51:40: 57=Consistency Check corrected medium error (VD 00/0 at 2b968f31, PD 04(e0/s4) at 2b968f31) 10/01/10 3:51:56: EVT#22854-10/01/10 3:51:56: 65=Consistency Check progress on VD 00/0 is 75.11%(6762s) Below are the messages from the patrol read that had a problem with pd4 too: 10/03/10 10:43:58: EVT#23025-10/03/10 10:43:58: 94=Patrol Read progress on PD 01(e0/s1) is 70.00%(6533s) 10/03/10 10:52:13: DEV_REC:Medium Error DevId[4] Tgt 4 retires=0 10/03/10 10:52:13: ErrLBAOffset (2616) LBA(2b968000) BadLba=2b96a616 10/03/10 10:52:13: prCallback: Medium Error on pd=04, StartLba=2b968000, ErrLba=2b96a616 10/03/10 10:52:13: prRecQueue: starting pd=04 recovery - blocking host commands 10/03/10 10:52:13: EVT#23026-10/03/10 10:52:13: 113=Unexpected sense: PD 04(e0/s4), CDB: 2f 00 2b 96 80 00 00 80 00 00, Sense: f0 00 03 2b 96 a6 16 0a 00 00 00 00 11 00 00 00 00 0 10/03/10 10:52:14: prRecGo: Ready to attempt recovery errLBA=2b96a616 on pd=04 10/03/10 10:52:14: prGetLDInfo: MediaErr in ld=0, span=0, arm=4 10/03/10 10:52:14: prRecGo: dataErr found on ld 0 span 0 arm 4 10/03/10 10:52:14: prRecGo: data NOT in cache; cacheLn=ffffffff, row=2b96a6, stripe=1311e8c, refBlk=16, type=0 10/03/10 10:52:14: prRecGo: R5-get cacheLn=f4, c_ptr=a1429700 mem=a13e6150 & setup cInx=49a c=a0dbcfe0 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=1 mem=a140aa14 stripe=2b96a6 type=0 i=1 status=0 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=2 mem=a140aa28 stripe=1311e8a type=0 i=0 status=1 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=3 mem=a140aa3c stripe=1311e8b type=0 i=0 status=2 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=5 mem=a140aa64 stripe=1311e8d type=0 i=0 status=4 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=6 mem=a140aa78 stripe=1311e8e type=0 i=0 status=5 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=7 mem=a140aa8c stripe=1311e8f type=0 i=0 status=6 10/03/10 10:52:14: RtnFrmPrcRcyRd for arm=0 mem=a140aa00 stripe=1311e90 type=0 i=0 status=7 10/03/10 10:52:14: Issuing write verify pd=04, arm=4, span=0, blk=2b96a616 10/03/10 10:52:14: EVT#23027-10/03/10 10:52:14: 110=Corrected medium error during recovery on PD 04(e0/s4) at 2b96a616 10/03/10 10:52:14: EVT#23028-10/03/10 10:52:14: 93=Patrol Read corrected medium error on PD 04(e0/s4) at 2b96a616 10/03/10 10:58:51: EVT#23029-10/03/10 10:58:51: 94=Patrol Read progress on PD 05(e0/s5) is 79.99%(7426s) Here's another error on pd4, but then we find problems on pd0 and pd2. Eventually PD0 is marked as failed (this was the 1st event of the month) and VD0 becomes degraded and the controller goes critical and locks up. 10/10/10 11:05:48: EVT#23129-10/10/10 11:05:48: 94=Patrol Read progress on PD 04(e0/s4) is 70.00%(6360s) 10/10/10 11:14:58: DEV_REC:Medium Error DevId[4] Tgt 4 retires=0 10/10/10 11:14:58: ErrLBAOffset (36b9) LBA(2b968000) BadLba=2b96b6b9 10/10/10 11:14:58: EVT#23130-10/10/10 11:14:58: 113=Unexpected sense: PD 04(e0/s4), CDB: 2f 00 2b 96 80 00 00 80 00 00, Sense: f0 00 03 2b 96 b6 b9 0a 00 00 00 00 11 00 00 00 00 0 10/10/10 11:14:58: prCallback: Medium Error on pd=04, StartLba=2b968000, ErrLba=2b96b6b9 10/10/10 11:14:58: prRecQueue: starting pd=04 recovery - blocking host commands 10/10/10 11:14:58: prRecGo: Ready to attempt recovery errLBA=2b96b6b9 on pd=04 10/10/10 11:14:59: prGetLDInfo: MediaErr in ld=0, span=0, arm=4 10/10/10 11:14:59: prRecGo: dataErr found on ld 0 span 0 arm 4 10/10/10 11:14:59: prRecGo: data NOT in cache; cacheLn=ffffffff, row=2b96b6, stripe=1311efc, refBlk=b9, type=0 10/10/10 11:14:59: prRecGo: R5-get cacheLn=541, c_ptr=a145d0c0 mem=a1405550 & setup cInx=562 c=a0dc65e0 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=1 mem=a13ef214 stripe=2b96b6 type=0 i=1 status=0 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=2 mem=a13ef228 stripe=1311efa type=0 i=0 status=1 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=3 mem=a13ef23c stripe=1311efb type=0 i=0 status=2 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=5 mem=a13ef264 stripe=1311efd type=0 i=0 status=4 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=6 mem=a13ef278 stripe=1311efe type=0 i=0 status=5 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=7 mem=a13ef28c stripe=1311eff type=0 i=0 status=6 10/10/10 11:14:59: RtnFrmPrcRcyRd for arm=0 mem=a13ef200 stripe=1311f00 type=0 i=0 status=7 10/10/10 11:14:59: Issuing write verify pd=04, arm=4, span=0, blk=2b96b6b9 10/10/10 11:14:59: EVT#23131-10/10/10 11:14:59: 110=Corrected medium error during recovery on PD 04(e0/s4) at 2b96b6b9 10/10/10 11:14:59: EVT#23132-10/10/10 11:14:59: 93=Patrol Read corrected medium error on PD 04(e0/s4) at 2b96b6b9 10/10/10 11:22:22: EVT#23133-10/10/10 11:22:22: 94=Patrol Read progress on PD 06(e0/s6) is 79.99%(7354s) 10/10/10 11:23:32: EVT#23134-10/10/10 11:23:32: 94=Patrol Read progress on PD 02(e0/s2) is 79.99%(7424s) 10/10/10 11:23:55: EVT#23135-10/10/10 11:23:55: 94=Patrol Read progress on PD 07(e0/s7) is 79.99%(7447s) 10/10/10 11:24:24: EVT#23136-10/10/10 11:24:24: 94=Patrol Read progress on PD 05(e0/s5) is 79.99%(7476s) 10/10/10 11:24:33: EVT#23137-10/10/10 11:24:33: 94=Patrol Read progress on PD 00(e0/s0) is 79.99%(7485s) 10/10/10 11:24:38: EVT#23138-10/10/10 11:24:38: 94=Patrol Read progress on PD 03(e0/s3) is 79.99%(7490s) 10/10/10 11:25:12: EVT#23139-10/10/10 11:25:12: 94=Patrol Read progress on PD 04(e0/s4) is 79.99%(7524s) 10/10/10 11:25:18: EVT#23140-10/10/10 11:25:18: 94=Patrol Read progress on PD 01(e0/s1) is 79.99%(7530s) 10/10/10 11:43:19: EVT#23141-10/10/10 11:43:19: 94=Patrol Read progress on PD 06(e0/s6) is 89.99%(8611s) 10/10/10 11:44:42: EVT#23142-10/10/10 11:44:42: 94=Patrol Read progress on PD 02(e0/s2) is 89.99%(8694s) 10/10/10 11:45:28: EVT#23143-10/10/10 11:45:28: 94=Patrol Read progress on PD 05(e0/s5) is 89.99%(8740s) 10/10/10 11:45:34: EVT#23144-10/10/10 11:45:34: 94=Patrol Read progress on PD 07(e0/s7) is 89.99%(8746s) 10/10/10 11:45:37: EVT#23145-10/10/10 11:45:37: 94=Patrol Read progress on PD 00(e0/s0) is 89.99%(8749s) 10/10/10 11:46:25: EVT#23146-10/10/10 11:46:25: 94=Patrol Read progress on PD 03(e0/s3) is 89.99%(8797s) 10/10/10 11:46:53: EVT#23147-10/10/10 11:46:53: 94=Patrol Read progress on PD 01(e0/s1) is 89.99%(8825s) 10/10/10 11:47:10: EVT#23148-10/10/10 11:47:10: 94=Patrol Read progress on PD 04(e0/s4) is 89.99%(8842s) 10/10/10 11:57:31: DEV_REC:Medium Error DevId[2] Tgt 2 retires=0 10/10/10 11:57:31: ErrLBAOffset (0) LBA(37c50000) BadLba=37c50000 10/10/10 11:57:31: prCallback: Medium Error on pd=02, StartLba=37c50000, ErrLba=37c50000 10/10/10 11:57:31: prRecQueue: starting pd=02 recovery - blocking host commands 10/10/10 11:57:31: EVT#23149-10/10/10 11:57:31: 113=Unexpected sense: PD 02(e0/s2), CDB: 2f 00 37 c5 00 00 00 80 00 00, Sense: f0 00 03 37 c5 00 00 0a 00 00 00 00 11 00 00 00 00 0 10/10/10 11:57:31: prRecGo: Ready to attempt recovery errLBA=37c50000 on pd=02 10/10/10 11:57:31: prGetLDInfo: MediaErr in ld=0, span=0, arm=2 10/10/10 11:57:31: prRecGo: dataErr found on ld 0 span 0 arm 2 10/10/10 11:57:31: prRecGo: data NOT in cache; cacheLn=ffffffff, row=37c500, stripe=1866302, refBlk=0, type=0 10/10/10 11:57:31: prRecGo: R5-get cacheLn=a24, c_ptr=a1497b00 mem=a13fdfa8 & setup cInx=533 c=a0dc42a0 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=7 mem=a13fb30c stripe=37c500 type=0 i=1 status=0 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=0 mem=a13fb280 stripe=1866300 type=0 i=0 status=1 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=1 mem=a13fb294 stripe=1866301 type=0 i=0 status=2 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=3 mem=a13fb2bc stripe=1866303 type=0 i=0 status=4 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=4 mem=a13fb2d0 stripe=1866304 type=0 i=0 status=5 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=5 mem=a13fb2e4 stripe=1866305 type=0 i=0 status=6 10/10/10 11:57:31: RtnFrmPrcRcyRd for arm=6 mem=a13fb2f8 stripe=1866306 type=0 i=0 status=7 10/10/10 11:57:39: DEV_REC:Medium Error DevId[0] Tgt 0 retires=0 10/10/10 11:57:39: ErrLBAOffset (0) LBA(37c50000) BadLba=37c50000 10/10/10 11:57:39: EVT#23150-10/10/10 11:57:39: 113=Unexpected sense: PD 00(e0/s0), CDB: 28 00 37 c5 00 00 00 00 01 00, Sense: f0 00 03 37 c5 00 00 0a 00 00 00 00 11 00 00 00 00 0 10/10/10 11:57:39: org resgnBlk=0 10/10/10 11:57:39: EVT#23151-10/10/10 11:57:39: 95=Patrol Read found an uncorrectable medium error on PD 02(e0/s2) at 37c50000 10/10/10 11:57:55: DEV_REC:Medium Error DevId[0] Tgt 0 retires=0 10/10/10 11:57:55: ErrLBAOffset (0) LBA(379d0000) BadLba=379d0000 10/10/10 11:57:55: prCallback: Medium Error on pd=00, StartLba=379d0000, ErrLba=379d0000 10/10/10 11:57:55: prRecQueue: starting pd=00 recovery - blocking host commands 10/10/10 11:57:55: EVT#23152-10/10/10 11:57:55: 113=Unexpected sense: PD 00(e0/s0), CDB: 2f 00 37 9d 00 00 00 80 00 00, Sense: f0 00 03 37 9d 00 00 0a 00 00 00 00 11 00 00 00 00 0 10/10/10 11:57:56: DEV_REC:Medium Error DevId[2] Tgt 2 retires=0 10/10/10 11:57:56: ErrLBAOffset (0) LBA(37c58001) BadLba=37c58001 10/10/10 11:57:56: EVT#23153-10/10/10 11:57:56: 113=Unexpected sense: PD 02(e0/s2), CDB: 2f 00 37 c5 80 01 00 80 00 00, Sense: f0 00 03 37 c5 80 01 0a 00 00 00 00 11 00 00 00 00 0 10/10/10 11:57:56: prCallback: Medium Error on pd=02, StartLba=37c58001, ErrLba=37c58001 10/10/10 11:57:56: prRecQueue: adding pd=02 to recovery wait queue 10/10/10 11:57:56: prRecGo: Ready to attempt recovery errLBA=379d0000 on pd=00 10/10/10 11:57:56: prGetLDInfo: MediaErr in ld=0, span=0, arm=0 10/10/10 11:57:56: prRecGo: dataErr found on ld 0 span 0 arm 0 10/10/10 11:57:56: prRecGo: data NOT in cache; cacheLn=ffffffff, row=379d00, stripe=1854b00, refBlk=0, type=0 10/10/10 11:57:56: prRecGo: R5-get cacheLn=105, c_ptr=a142a3c0 mem=a13fdd00 & setup cInx=532 c=a0dc41e0 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=7 mem=a140940c stripe=379d00 type=0 i=1 status=0 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=1 mem=a1409394 stripe=1854b01 type=0 i=0 status=2 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=2 mem=a14093a8 stripe=1854b02 type=0 i=0 status=3 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=3 mem=a14093bc stripe=1854b03 type=0 i=0 status=4 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=4 mem=a14093d0 stripe=1854b04 type=0 i=0 status=5 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=5 mem=a14093e4 stripe=1854b05 type=0 i=0 status=6 10/10/10 11:57:56: RtnFrmPrcRcyRd for arm=6 mem=a14093f8 stripe=1854b06 type=0 i=0 status=7 10/10/10 11:57:56: Issuing write verify pd=00, arm=0, span=0, blk=379d0000 10/10/10 11:57:56: EVT#23154-10/10/10 11:57:56: 110=Corrected medium error during recovery on PD 00(e0/s0) at 379d0000 10/10/10 11:58:03: DEV_REC:Medium Error DevId[0] Tgt 0 retires=0 10/10/10 11:58:03: ErrLBAOffset (0) LBA(379d0000) BadLba=379d0000 10/10/10 11:58:03: Write MED ERR!!! ErrLBA(0) LBA(379d0000) 10/10/10 11:58:03: EVT#23155-10/10/10 11:58:03: 113=Unexpected sense: PD 00(e0/s0), CDB: 2e 00 37 9d 00 00 00 00 01 00, Sense: f0 00 03 37 9d 00 00 0a 00 00 00 00 11 00 00 00 00 0 10/10/10 11:58:03: EVT#23156-10/10/10 11:58:03: 108=Reassign write operaiton failed on PD 00(e0/s0) at 9d370000 10/10/10 11:58:03: EVT#23157-10/10/10 11:58:03: 87=Error on PD 00(e0/s0) (Error 02) 10/10/10 11:58:03: EVT#23158-10/10/10 11:58:03: 81=State change on VD 00/0 from OPTIMAL(3) to DEGRADED(2) 10/10/10 11:58:03: EVT#23159-10/10/10 11:58:03: 251=VD 00/0 is now DEGRADED 10/10/10 11:58:03: EVT#23160-10/10/10 11:58:03: 114=State change on PD 00(e0/s0) from ONLINE(18) to FAILED(11) 10/10/10 11:58:03: EVT#23161-10/10/10 11:58:03: 108=Reassign write operaiton failed on PD 00(e0/s0) at 379d0000 10/10/10 11:58:03: EVT#23162-10/10/10 11:58:03: 93=Patrol Read corrected medium error on PD 00(e0/s0) at 379d0000 10/10/10 11:58:03: prRecGo: Ready to attempt recovery errLBA=37c58001 on pd=02 10/10/10 11:58:03: prGetLDInfo: MediaErr in ld=0, span=0, arm=2 10/10/10 11:58:03: prRecGo: dataErr found on ld 0 span 0 arm 2 10/10/10 11:58:03: prRecGo: no correction for degraded LD 0 - continue 10/10/10 11:58:03: prDiskCheckOkToRun: PR cannot run on this pd=0 not a spare and online 10/10/10 11:58:04: DM_ProcessMsg: DevState UnKnown DevId 0 Flags f0400005 Rdm a04a2400 10/10/10 11:58:04: MPT_ProcessIo: SMP/STP Completed without ReplyFrame Rdm a04a2400 Cmd 9 10/10/10 17:21:38: EVT#23193-10/10/10 17:21:38: 44=Time established as 10/10/10 17:21:38; (18 seconds since power on) 10/10/10 17:21:38: LOAD section: src=9e974184, size=1a88, dst=0, mode=1...done 10/10/10 17:21:38: LOAD section: src=9e940060, size=32808, dst=a0200000, mode=1...done 10/10/10 17:21:39: LOAD section: src=9e972871, size=190a, dst=a02850c0, mode=1...done 10/10/10 17:21:39: CTLR version 1.04-019A Date Aug 13 2007 Time 23:21:34 10/10/10 17:21:39: Vendor Id: 15 SubVId: 1f03 DevId: 1028 SubDevId: 1028 10/10/10 17:22:36: EVT#23194-10/10/10 17:22:36: 149=Battery temperature is normal 10/10/10 17:22:36: EVT#23195-10/10/10 17:22:36: 147=Battery started charging 10/10/10 17:22:36: EVT#23196-10/10/10 17:22:36: 163=Current capacity of the battery is above threshold 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=01 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=02 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=03 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=04 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=05 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=06 10/10/10 17:22:38: prDiskStart: starting Patrol Read on PD=07 10/10/10 17:22:38: EVT#23197-10/10/10 17:22:38: 38=Patrol Read resumed 10/10/10 17:22:54: EVT#23198-10/10/10 17:22:54: 94=Patrol Read progress on PD 06(e0/s6) is 96.59%(16s) 10/10/10 17:22:54: EVT#23199-10/10/10 17:22:54: 94=Patrol Read progress on PD 04(e0/s4) is 95.28%(16s) 10/10/10 17:22:54: EVT#23200-10/10/10 17:22:54: 94=Patrol Read progress on PD 03(e0/s3) is 95.57%(16s) 10/10/10 17:22:54: EVT#23201-10/10/10 17:22:54: 94=Patrol Read progress on PD 07(e0/s7) is 95.98%(16s) 10/10/10 17:22:54: EVT#23202-10/10/10 17:22:54: 94=Patrol Read progress on PD 05(e0/s5) is 95.93%(16s) 10/10/10 17:22:54: EVT#23203-10/10/10 17:22:54: 94=Patrol Read progress on PD 02(e0/s2) is 96.02%(16s) 10/10/10 17:22:54: EVT#23204-10/10/10 17:22:54: 94=Patrol Read progress on PD 01(e0/s1) is 95.39%(16s) 10/10/10 17:23:44: EVT#23205-10/10/10 17:23:44: 105=Rebuild started on PD 00(e0/s0) 10/10/10 17:23:44: EVT#23206-10/10/10 17:23:44: 114=State change on PD 00(e0/s0) from FAILED(11) to REBUILD(14) 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=6 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=06 - state changed 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=2 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=02 - state changed 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=3 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=03 - state changed 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=7 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=07 - state changed 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=1 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=01 - state changed 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=5 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=05 - state changed 10/10/10 17:23:44: prDiskCheckOkToRun: PR cannot run on this pd=4 this array=0 is rebuilding 10/10/10 17:23:44: prCallback: PR being stopped for pd=04 - state changed 10/10/10 17:23:44: PR cycle complete 10/10/10 17:23:44: EVT#23207-10/10/10 17:23:44: 35=Patrol Read complete 10/10/10 17:23:44: Next PR scheduled to start at 10/17/10 9:19:48 10/10/10 17:24:34: EVT#23208-10/10/10 17:24:34: 103=Rebuild progress on PD 00(e0/s0) is 0.99%(50s) You can see above where i started the rebuild of PD0. That succeeded without an issue. Something similar to the above happened today with PD2, which was indicated previously too except that PD0 was the one that got marked failed. All of these incidents seem to happen soon after an issue/medium error recovery on PD4. And there are various errors in prior months related to PD4. My question is, are the issues with PD0 or PD2 going into "failed" state really a symptom of the issues going on with PD4? Looking at the SMART data for PD4, i noticed that there are 5 sectors in "pending" state. I believe this means they are pending reallocation? All other drives 0-3,5-7 show no such issues in their SMART data. My gut thinks the issue is really with PD4 and if i can resolve that, I won't have these mysterious FAILED drive events anymore. I'm in the process of verifying the last backup is good. Once I verify that, I plan to force PD4 to rebuild in hopes that during the rebuild the "pending" 5 sectors will get corrected and actually reallocated. I know there are some guys on this list that know more than me and have more experience... so I just wanted to share my thoughts/suspicions about PD4 and get your opinion. What do you think? PD4 the culprit or beware of PD0 and PD2 too? TIA, -Bond _______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
