Actually a correction to the previous note.. looks like the Sense Key messages 
are coming from our external  storage but the IO errors are from the 
internal... still strange.  External is an MD1120 attached via a PERC 6 as well.

--Chris



From: [email protected] 
[mailto:[email protected]] On Behalf Of Chris Trainor
Sent: Tuesday, May 11, 2010 1:50 PM
To: [email protected]
Subject: Massive sense key & IO errors and eventual crashing. R900 with Perc 6i

HI all,

Having some odd issues with our internal storage on our R900.    We get 
hundreds of SCSI sense key errors reported on all the disks all day long.... 
Eventually we'll get I/O errors a few times a week and the drives go offline 
and system crashes.

Here are some examples:
(just prior to crash)

May 10 02:05:57 mackey kernel: megasas: [20]waiting for 127 commands to complete
May 10 02:06:02 mackey kernel: megasas: [25]waiting for 127 commands to complete
May 10 02:06:07 mackey kernel: megasas: [30]waiting for 127 commands to complete


May 10 02:08:38 mackey kernel: megasas[0]: Frame addr :0xbfaa4800 : 
<3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x1, lba l
o : 0x63c01bf, lba_hi : 0x0, sense_buf addr : 0x37f47b00,sge count : 0x1
May 10 02:08:38 mackey kernel:
May 10 02:08:38 mackey kernel: megasas[0]: Frame addr :0xbfaa4c00 : 
<3>megasas[0]: frame count : 0x1, Cmd : 0x2, Tgt id : 0x1, lba l
o : 0x63a991f, lba_hi : 0x0, sense_buf addr : 0x37f47b80,sge count : 0x1


May 10 02:08:38 mackey kernel: megasas[0]: Pending Internal cmds in FW :
May 10 02:08:38 mackey kernel: megasas[0]: Dumping Done.
May 10 02:08:38 mackey kernel:
May 10 02:08:38 mackey kernel: megasas: failed to do reset
May 10 02:08:38 mackey kernel: sd 0:2:1:0: megasas: RESET -264942026 cmd=2a 
retries=0
May 10 02:08:38 mackey kernel: megasas: cannot recover from previous reset 
failures
May 10 02:08:38 mackey kernel: sd 0:2:1:0: megasas: RESET -264942026 cmd=2a 
retries=0

May 10 02:08:38 mackey kernel: sd 0:2:0:0: scsi: Device offlined - not ready 
after error recovery
May 10 02:08:38 mackey kernel: sd 0:2:1:0: scsi: Device offlined - not ready 
after error recovery
May 10 02:08:38 mackey last message repeated 106 times


May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 111033375
May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 
13879116
May 10 02:08:38 mackey kernel: lost page write due to I/O error on dm-5
May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 
13879117
May 10 02:08:38 mackey kernel: lost page write due to I/O error on dm-5
May 10 02:08:38 mackey kernel: Buffer I/O error on device dm-5, logical block 
13879118
(dozens more of the IO errors.... )

May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 111033767
May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s

May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: Aborting journal on device dm-6.
May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: EXT3-fs error (device dm-6): read_block_bitmap: 
Cannot read block bitmap - block_group = 53, block_bi
tmap = 1736704
May 10 02:08:38 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: ext3_abort called.
May 10 02:08:38 mackey kernel: EXT3-fs error (device dm-6): 
ext3_journal_start_sb: Detected aborted journal
May 10 02:08:38 mackey kernel: Remounting filesystem read-only
May 10 02:08:38 mackey kernel: Aborting journal on device dm-5.
May 10 02:08:38 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May 10 02:08:38 mackey kernel: __journal_remove_journal_head: freeing 
b_committed_data
May 10 02:08:38 mackey kernel: sd 0:2:0:0: rejecting I/O to offline device
May 10 02:08:38 mackey last message repeated 3 times

And the death throws just prior to crashing/reboot.  (obviously the clock is 
off here....  Need to fix that. :) )

May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 104737343
May 10 02:08:38 mackey kernel: sd 0:2:1:0: timing out command, waited 360s
May 10 02:08:38 mackey kernel: sd 0:2:1:0: SCSI error: return code = 0x06000000
May 10 02:08:38 mackey kernel: end_request: I/O error, dev sdb, sector 104745543
May 10 02:08:44 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May 10 02:08:44 mackey kernel: printk: 1724 messages suppressed.
May 10 02:08:44 mackey kernel: Buffer I/O error on device dm-5, logical block 
13094162
May 10 02:08:44 mackey kernel: lost page write due to I/O error on dm-5
May 10 02:08:44 mackey kernel: sd 0:2:1:0: rejecting I/O to offline device
May  9 22:27:02 mackey syslogd 1.4.1: restart.
May  9 22:27:02 mackey kernel: klogd 1.4.1, log source = /proc/kmsg started.
May  9 22:27:02 mackey kernel: Linux version 2.6.18-128.1.10.el5 
([email protected]<mailto:[email protected]>) (gcc 
version 4.1.2 20080704 (Red H
at 4.1.2-44)) #1 SMP Thu May 7 10:35:59 EDT 2009

(after reboot... this is what shows most of the day)


May 10 02:35:52 mackey Server Administrator: Storage Service EventID: 2095  
SCSI sense data Sense key:  B Sense code: 4B Sense quali
fier:  4:  Physical Disk 1:0:13 Controller 0, Connector 1
May 10 02:35:54 mackey Server Administrator: Storage Service EventID: 2095  
SCSI sense data Sense key:  B Sense code: 4B Sense quali
fier:  4:  Physical Disk 1:0:2 Controller 0, Connector 1
May 10 02:35:54 mackey Server Administrator: Storage Service EventID: 2095  
SCSI sense data Sense key:  B Sense code: 4B Sense quali
fier:  4:  Physical Disk 1:0:1 Controller 0, Connector 1




Any ideas what could be causing this?   This is all internal disk, nothing 
external.   CentOS 5.4 kernel 2.6.18-128.1.10.el5 #1 SMP Thu May 7 10:35:59 EDT 
2009 x86_64 x86_64 x86_64 GNU/Linux


Thanks,
--Chris


Christopher M. Trainor
Manager, IT & Network Operations
Quick Hit, Inc.
o.  508.203.4857
w.  www.quickhit.com



_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

Reply via email to