Hi Dell Linux Folks, Our site recently purchased a number of Dell R710's with 2 Perc6/E controllers, and 3 MD1000's SAS-connected to each controller, 6 MD1000's per R710.
We are running Scientific Linux 5.3, which is very close to RHEL5. We have seen two critical failures recently on different nodes, with the megasas drivers and Perc6/E controllers becoming entirely unresponsive, and all volumes associated with the controller going offline. I'm curious if anyone on this list has seen any of this variety before, and has any suggestions for what to do to fix it. So far we've been able to duplicate it by applying a large number of writes over the network into this storage. The only solution to remove ourselves from this state is to do a power cycle, as the kernel won't even shutdown cleanly once it reaches this state. A more complete description of the problem is below. At some point during write-load, the RAID controller or enclosure falls into a state where it is failing to correctly communicate with the OS. This causes at least the virtual disk device (e.g. /dev/sde) and usually all other virtual disk devices on the controller to fail. The failure is complete, no data can be read or written, and an ls -l shows a 'no available memory' error. Once we fall into this state, there is no solution available except to reboot the whole server. Once the server reboots, the stack is reset and the machine works as expected. Looking into the details of the driver and firmware version stack I find the following: We are running the latest firmware on these MD1000s: A.04 We are running the latest firmware on the Perc6/E: 6.2.0-0013 We are running a slightly older driver than dell recommends on their site for the megasas: 00.00.04.08-RH2 Dell suggests 00.00.04.17. We have upgraded 2 nodes to this new driver version and are testing with that configuration as well. So far we have seen 2 of these failures, so more testing will be needed to ensure that we have a good idea to see how often this problem can be expected. Anyone else have any ideas or questions? Details of the symptoms of the failure on uct2-s8 The issue starts with this message: sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0 Basically we're trying to send a RESET across the sas channel. This is a relatively common activity for SAS when dealing with physical device timeout, but it's failing in this case. However, once we see this message: megasas: failed to do reset We know we've failed and that we've got larger issues. sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0 megasas: cannot recover from previous reset failures sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0 megasas: cannot recover from previous reset failures sd 2:2:2:0: scsi: Device offlined - not ready after error recovery sd 2:2:2:0: scsi: Device offlined - not ready after error recovery Once this happens, we are in the state of filesystem known as "entirely broken": sd 2:2:2:0: timing out command, waited 360s sd 2:2:2:0: SCSI error: return code = 0x06000000 end_request: I/O error, dev sdg, sector 34909388801 sd 2:2:2:0: timing out command, waited 360s sd 2:2:2:0: SCSI error: return code = 0x06000000 end_request: I/O error, dev sdg, sector 34909388808 sd 2:2:2:0: rejecting I/O to offline device sd 2:2:2:0: rejecting I/O to offline device sd 2:2:2:0: rejecting I/O to offline device Device sdg, XFS metadata write error block 0x820c30001 in sdg We have a call trace which indicates some kind of low level IRQ problem: Call Trace: <IRQ> [<ffffffff800bae01>] __report_bad_irq+0x30/0x7d [<ffffffff800bb034>] note_interrupt+0x1e6/0x227 [<ffffffff800ba530>] __do_IRQ+0xbd/0x103 [<ffffffff80012348>] __do_softirq+0x89/0x133 [<ffffffff8006c9bf>] do_IRQ+0xe7/0xf5 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e [<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e [<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e [<ffffffff80197b2d>] acpi_safe_halt+0x25/0x36 [<ffffffff8019834a>] acpi_processor_idle_simple+0xe0/0x30e [<ffffffff8006b126>] __exit_idle+0x1c/0x2a [<ffffffff8019826a>] acpi_processor_idle_simple+0x0/0x30e [<ffffffff800494cc>] cpu_idle+0x95/0xb8 [<ffffffff803fd7fd>] start_kernel+0x220/0x225 [<ffffffff803fd22f>] _sinittext+0x22f/0x236 handlers: [<ffffffff880b65ec>] (megasas_isr+0x0/0x45 [megaraid_sas]) Disabling IRQ #122
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ Linux-PowerEdge mailing list [email protected] https://lists.us.dell.com/mailman/listinfo/linux-poweredge Please read the FAQ at http://lists.us.dell.com/faq
