PERC5 (LSI MegaSAS) Patrol Read crashes

Sean McAfee Fri, 28 Sep 2007 14:23:38 -0700

We first became aware of this problem about a month ago. A databaseserver was up but was completely unresponsive to anything other thanpings. I power cycled it via the DRAC and after we couldn't findanything suspicious in the logs, we figured it was a fluke.Until the next day, when its twin did the same exact thing. This time,I was able to get a screen shot through the DRAC console. Using olddaily outputs and that screenshot, we correlated the crashes to patrolreads. Since then, we've only seen it "in the wild" on one othermachine, a 1950, but I've been trying to chase the problem down withoutmuch luck.I'm fortunate to have three machines at my disposal for this testing, soI was able to try a variety of combinations:


Server 1:
Chassis:          2950 v1

System BIOS: 1.1.0PERC firmware: 1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package)

OS:               6.2-R_p7, 6-STABLE

Server 2:
Chassis:          2950 v1
System BIOS:      1.1.0
PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS:               6.2-R_p7, 6-STABLE

Server 3:
Chassis:          2950 v2
System BIOS:      1.5.1
PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS:               6.2-R_p7

They're all running amd64 and each combination was tried with andwithout the linux_mfi.ko patches found in PR-113232. For disks, they allhave 2x36gb RAID1, 4x73gb RAID10 (all SAS.) We use linux_mfi.ko+linux-megaclifor management.

The original problem occurred during automatic patrol reads coupled withheavy disk load. I've changed the delay interval for the automaticpatrol reads and tried to reproduce it but haven't had enough success tomake it useful for troubleshooting. Since the automatic reads are meantto be as least aggressive as possible, I've been running a manual patrolread (megacli -AdpPR -Start -a0), which triggers a crash regardlessof what I/O is like.The behavior has little to no variation; shortly after the read isstarted, disk writes immediately cease (shown via an scp from anothermachine). After a minute, the console will begin to fill up with linessuch as:


mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS

The first 8 values of the hex never change - I bring that up because Isuspect the problem has something to do with the enclosure, which isattached at 8, 255, or fffffff, depending on where you're looking.

I've let it go up to 6000 seconds, but it eventually ends in a kernel panic.

That just seems to be a side effect of the original problem (processes withnowhere to write data), so I'm not too hung up on that.

There's never anything pertaining to it in the controller's event log.

Besides the platform version differences I mentioned above, I've tried:
- Reducing the patrol read rate

- Pulling down and modifying the patches from PR-115133 (which seems toset an upper boundary at 0xffffffff)

- Invoking a0/aALL interchangeably
- Changing the cache flush interval
- Disabling disk coercion

- A bunch of other long-shot settings from megacli that aren't worthlisting


Nothing has shown any appreciable difference in the behavior.

Does anyone have an idea about what could be going on or anything elsewe can try? For now, I'll probably just disable them and set themto auto/1 hour delay during outage windows only, but I'm hoping thatsomeone is able to help with this. At the very least, maybe I can savesomeone a whole bunch of time.

Thanks in advance for any help.

--
Sean McAfee
Collaborative Fusion, Inc.
 [EMAIL PROTECTED]
 412-422-3463 x 4025

1710 Murray Avenue, Suite 320
Pittsburgh, PA 15217

****************************************************************
IMPORTANT: This message contains confidential information
and is intended only for the individual named. If the reader of
this message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

PERC5 (LSI MegaSAS) Patrol Read crashes

Reply via email to