Hi Dell Linux Folks,

Our site recently purchased a number of Dell R710's with 2 Perc6/E controllers, 
and 3 MD1000's SAS-connected to each controller, 6 MD1000's per R710.

We are running Scientific Linux 5.3, which is very close to RHEL5.

We have seen two critical failures recently on different nodes, with the 
megasas drivers and Perc6/E controllers becoming entirely unresponsive, and all 
volumes associated with the controller going offline.  I'm curious if anyone on 
this list has seen any of this variety before, and has any suggestions for what 
to do to fix it.  So far we've been able to duplicate it by applying a large 
number of writes over the network into this storage.  The only solution to 
remove ourselves from this state is to do a power cycle, as the kernel won't 
even shutdown cleanly once it reaches this state.


A more complete description of the problem is below.


At some point during write-load, the RAID controller or enclosure falls into a 
state where it is failing to correctly communicate with the OS.  This causes at 
least the virtual disk device (e.g. /dev/sde) and usually all other virtual 
disk devices on the controller to fail.  The failure is complete, no data can 
be read or written, and an ls -l shows a 'no available memory' error.

Once we fall into this state, there is no solution available except to reboot 
the whole server.  Once the server reboots, the stack is reset and the machine 
works as expected.

Looking into the details of the driver and firmware version stack I find the 
following:

We are running the latest firmware on these MD1000s: A.04
We are running the latest firmware on the Perc6/E: 6.2.0-0013
We are running a slightly older driver than dell recommends on their site for 
the megasas: 00.00.04.08-RH2
Dell suggests 00.00.04.17.  We have upgraded 2 nodes to this new driver version 
and are testing with that configuration as well.

So far we have seen 2 of these failures, so more testing will be needed to 
ensure that we have a good idea to see how often this problem can be expected.

Anyone else have any ideas or questions?


Details of the symptoms of the failure on uct2-s8

The issue starts with this message:
sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0
Basically we're trying to send a RESET across the sas channel.  This is a 
relatively common activity for SAS when dealing with physical device timeout, 
but it's failing in this case.  However, once we see this message:
megasas: failed to do reset
We know we've failed and that we've got larger issues.
sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0
megasas: cannot recover from previous reset failures
sd 2:2:2:0: megasas: RESET -12716530 cmd=8a retries=0
megasas: cannot recover from previous reset failures
sd 2:2:2:0: scsi: Device offlined - not ready after error recovery
sd 2:2:2:0: scsi: Device offlined - not ready after error recovery

Once this happens, we are in the state of filesystem known as "entirely broken":
sd 2:2:2:0: timing out command, waited 360s
sd 2:2:2:0: SCSI error: return code = 0x06000000
end_request: I/O error, dev sdg, sector 34909388801
sd 2:2:2:0: timing out command, waited 360s
sd 2:2:2:0: SCSI error: return code = 0x06000000
end_request: I/O error, dev sdg, sector 34909388808
sd 2:2:2:0: rejecting I/O to offline device
sd 2:2:2:0: rejecting I/O to offline device
sd 2:2:2:0: rejecting I/O to offline device
Device sdg, XFS metadata write error block 0x820c30001 in sdg

We have a call trace which indicates some kind of low level IRQ problem:

Call Trace:
<IRQ>  [<ffffffff800bae01>] __report_bad_irq+0x30/0x7d
[<ffffffff800bb034>] note_interrupt+0x1e6/0x227
[<ffffffff800ba530>] __do_IRQ+0xbd/0x103
[<ffffffff80012348>] __do_softirq+0x89/0x133
[<ffffffff8006c9bf>] do_IRQ+0xe7/0xf5
[<ffffffff8005d615>] ret_from_intr+0x0/0xa
<EOI>  [<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e
[<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e
[<ffffffff801983e7>] acpi_processor_idle_simple+0x17d/0x30e
[<ffffffff80197b2d>] acpi_safe_halt+0x25/0x36
[<ffffffff8019834a>] acpi_processor_idle_simple+0xe0/0x30e
[<ffffffff8006b126>] __exit_idle+0x1c/0x2a
[<ffffffff8019826a>] acpi_processor_idle_simple+0x0/0x30e
[<ffffffff800494cc>] cpu_idle+0x95/0xb8
[<ffffffff803fd7fd>] start_kernel+0x220/0x225
[<ffffffff803fd22f>] _sinittext+0x22f/0x236

handlers:
[<ffffffff880b65ec>] (megasas_isr+0x0/0x45 [megaraid_sas])
Disabling IRQ #122

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Linux-PowerEdge mailing list
[email protected]
https://lists.us.dell.com/mailman/listinfo/linux-poweredge
Please read the FAQ at http://lists.us.dell.com/faq

Reply via email to