We first became aware of this problem about a month ago. A database server was up but was completely unresponsive to anything other than pings. I power cycled it via the DRAC and after we couldn't find anything suspicious in the logs, we figured it was a fluke. Until the next day, when its twin did the same exact thing. This time, I was able to get a screen shot through the DRAC console. Using old daily outputs and that screenshot, we correlated the crashes to patrol reads. Since then, we've only seen it "in the wild" on one other machine, a 1950, but I've been trying to chase the problem down without much luck. I'm fortunate to have three machines at my disposal for this testing, so I was able to try a variety of combinations:

Server 1:
Chassis:          2950 v1
System BIOS: 1.1.0 PERC firmware: 1.00.01-0088 PERC F/W (from the 5.0.1-0030 A00 package)
OS:               6.2-R_p7, 6-STABLE

Server 2:
Chassis:          2950 v1
System BIOS:      1.1.0
PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS:               6.2-R_p7, 6-STABLE

Server 3:
Chassis:          2950 v2
System BIOS:      1.5.1
PERC firmware:    1.03.10-0216 PERC F/W (from the 5.1.1-0040 package)
OS:               6.2-R_p7

They're all running amd64 and each combination was tried with and without the linux_mfi.ko patches found in PR-113232. For disks, they all have 2x36gb RAID1, 4x73gb RAID10 (all SAS.) We use linux_mfi.ko+linux-megacli for management.

The original problem occurred during automatic patrol reads coupled with heavy disk load. I've changed the delay interval for the automatic patrol reads and tried to reproduce it but haven't had enough success to make it useful for troubleshooting. Since the automatic reads are meant to be as least aggressive as possible, I've been running a manual patrol read (megacli -AdpPR -Start -a0), which triggers a crash regardless of what I/O is like. The behavior has little to no variation; shortly after the read is started, disk writes immediately cease (shown via an scp from another machine). After a minute, the console will begin to fill up with lines such as:

mif0: COMMAND 0xffffffff892bc998 TIMEOUT AFTER 45 SECONDS

The first 8 values of the hex never change - I bring that up because I suspect the problem has something to do with the enclosure, which is attached at 8, 255, or fffffff, depending on where you're looking.
I've let it go up to 6000 seconds, but it eventually ends in a kernel panic.
That just seems to be a side effect of the original problem (processes with nowhere to write data), so I'm not too hung up on that.
There's never anything pertaining to it in the controller's event log.

Besides the platform version differences I mentioned above, I've tried:
- Reducing the patrol read rate
- Pulling down and modifying the patches from PR-115133 (which seems to set an upper boundary at 0xffffffff)
- Invoking a0/aALL interchangeably
- Changing the cache flush interval
- Disabling disk coercion
- A bunch of other long-shot settings from megacli that aren't worth listing

Nothing has shown any appreciable difference in the behavior.

Does anyone have an idea about what could be going on or anything else we can try? For now, I'll probably just disable them and set them to auto/1 hour delay during outage windows only, but I'm hoping that someone is able to help with this. At the very least, maybe I can save someone a whole bunch of time.
Thanks in advance for any help.

--
Sean McAfee
Collaborative Fusion, Inc.
 [EMAIL PROTECTED]
 412-422-3463 x 4025

1710 Murray Avenue, Suite 320
Pittsburgh, PA 15217

****************************************************************
IMPORTANT: This message contains confidential information
and is intended only for the individual named. If the reader of
this message is not an intended recipient (or the individual
responsible for the delivery of this message to an intended
recipient), please be advised that any re-use, dissemination,
distribution or copying of this message is prohibited. Please
notify the sender immediately by e-mail if you have received
this e-mail by mistake and delete this e-mail from your system.
E-mail transmission cannot be guaranteed to be secure or
error-free as information could be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
sender therefore does not accept liability for any errors or
omissions in the contents of this message, which arise as a
result of e-mail transmission.
****************************************************************

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hardware
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to