Actually a correction to the previous note.. looks like the Sense Key messages are coming from our external storage but the IO errors are from the internal... still strange. External is an MD1120 attached via a PERC 6 as well.
--Chris From: [email protected] [mailto:[email protected]] On Behalf Of Chris Trainor Sent: Tuesday, May 11, 2010 1:50 PM To: [email protected] Subject: Massive sense key & IO errors and eventual crashing. R900 with Perc 6i HI all, Having some odd issues with our internal storage on our R900. We get hundreds of SCSI sense key errors reported on all the disks all day long.... Eventually we'll get I/O errors a few times a week and the drives go offline and system crashes. Here are some examples: (just prior to crash) May 10 02:05:57 mackey kernel: megasas: [20]waiting for 127 commands to complete May 10 02:06:02 mackey kernel: megasas: [25]waiting for 127 commands to complete May 10 02:06:07 mackey kernel: megasas: [30]waiting for 127 commands to complete May 10 02:08:38 mackey kernel: megasas[0]: Frame addr :0xbfaa4800 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x1, lba l o : 0x63c01bf, lba_hi : 0x0, sense_buf addr : 0x37f47b00,sge count : 0x1 May 10 02:08:38 mackey kernel: May 10 02:08:38 mackey kernel: megasas[0]: Frame addr :0xbfaa4c00 : <3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x1, lba l o : 0x63a991f, lba_hi : 0x0, sense_buf addr : 0x37f47b80,sge count : 0x1 May 10 02:08:38 mackey kernel: megasas[0]: Pending Internal cmds in FW : May 10 02:08:38 mackey kernel: megasas[0]: Dumping Done. May 10 02:08:38 mackey kernel: May 10 02:08:38 mackey kernel: megasas: failed to do reset May 10 02:08:38 mackey kernel: sd 0:2:1:0: megasas: RESET -264942026 cmd=2a retries=0 May 10 02:08:38 mackey kernel: megasas: cannot recover from previous reset failures May 10 02:08:38 mackey kernel: sd 0:2:1:0: megasas: RESET -264942026 cmd=2a retries=0 May 10 02:08:38 mackey kernel: sd 0:2:0:0: scsi: Device offlined - not ready after error recovery May 10 02:08:38 mackey kernel: sd 0:2:1:0: scsi: Device offlined - not ready after error recovery May 10 02:08:38 mackey last message repeated 106 times May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000 May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 111033375 May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 13879116 May 10 02:08:38 mackey kernel: lost page write due to I/O error on dm-5 May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 13879117 May 10 02:08:38 mackey kernel: lost page write due to I/O error on dm-5 May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 13879118 (dozens more of the IO errors.... ) May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000 May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 111033767 May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device May 10 02:08:38 mackey kernel: Aborting journal on device dm-6. May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device May 10 02:08:38 mackey kernel: EXT3-fs error (device dm-6): read_block_bitmap: Cannot read block bitmap - block_group = 53, block_bi tmap = 1736704 May 10 02:08:38 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device May 10 02:08:38 mackey kernel: ext3_abort called. May 10 02:08:38 mackey kernel: EXT3-fs error (device dm-6): ext3_journal_start_sb: Detected aborted journal May 10 02:08:38 mackey kernel: Remounting filesystem read-only May 10 02:08:38 mackey kernel: Aborting journal on device dm-5. May 10 02:08:38 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device May 10 02:08:38 mackey kernel: __journal_remove_journal_head: freeing b_committed_data May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device May 10 02:08:38 mackey last message repeated 3 times And the death throws just prior to crashing/reboot. (obviously the clock is off here.... Need to fix that. :) ) May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000 May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 104737343 May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000 May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 104745543 May 10 02:08:44 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device May 10 02:08:44 mackey kernel: printk: 1724 messages suppressed. May 10 02:08:44 mackey kernel: Buffer I/O error on device dm-5, logical block 13094162 May 10 02:08:44 mackey kernel: lost page write due to I/O error on dm-5 May 10 02:08:44 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device May 9 22:27:02 mackey syslogd 1.4.1: restart. May 9 22:27:02 mackey kernel: klogd 1.4.1, log source = /proc/kmsg started. May 9 22:27:02 mackey kernel: Linux version 2.6.18-128.1.10.el5 ([email protected]<mailto:[email protected]>) (gcc version 4.1.2 20080704 (Red H at 4.1.2-44)) #1 SMP Thu May 7 10:35:59 EDT 2009 (after reboot... this is what shows most of the day) May 10 02:35:52 mackey Server Administrator: Storage Service EventID: 2095 SCSI sense data Sense key: B Sense code: 4B Sense quali fier: 4: Physical Disk 1:0:13 Controller 0, Connector 1 May 10 02:35:54 mackey Server Administrator: Storage Service EventID: 2095 SCSI sense data Sense key: B Sense code: 4B Sense quali fier: 4: Physical Disk 1:0:2 Controller 0, Connector 1 May 10 02:35:54 mackey Server Administrator: Storage Service EventID: 2095 SCSI sense data Sense key: B Sense code: 4B Sense quali fier: 4: Physical Disk 1:0:1 Controller 0, Connector 1 Any ideas what could be causing this? This is all internal disk, nothing external. CentOS 5.4 kernel 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux Thanks, --Chris Christopher M. Trainor Manager, IT & Network Operations Quick Hit, Inc. o. 508.203.4857 w. www.quickhit.com
_______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
