On Sun, May 01, 2011 at 14:42:21 +0400, Dmitry Morozovsky wrote: > On Sat, 30 Apr 2011, Kenneth D. Merry wrote: > > KDM> On Fri, Apr 29, 2011 at 11:51:21 +0400, Dmitry Morozovsky wrote: > KDM> > Dear Ken, > KDM> > > KDM> > I have SuperMicro Server with mps driver you managed, with 24 SATA > disks under > KDM> > SAS x36 expander with large ZFS > KDM> > > KDM> > Sometimes, under random disk load such as daily find, it lost all its > devices: > KDM> > > KDM> > [-- MARK -- Fri Apr 29 03:00:00 2011] > KDM> > mps0: IOC Fault 0x40005900, Resetting^M > KDM> > (pass20:mps0:0:22:0): SCSI command timeout on device handle 0x0020 > SMID 442^M > KDM> > mps0: IOC Fault 0x40001500, Resetting^M > KDM> > (da19:mps0:0:21:0): SCSI command timeout on device handle 0x001f SMID > 172^M > KDM> > (da19:mps0:0:21:0): SCSI command timeout on device handle 0x001f SMID > 511^M > KDM> > (da20:mps0:0:20:0): SCSI command timeout on device handle 0x001e SMID > 240^M > KDM> > > KDM> > .. > KDM> > > KDM> > (da4:mps0:0:0:0): SCSI command timeout on device handle 0x000a SMID > 844^M > KDM> > (da22:mps0:0:23:0): SCSI command timeout on device handle 0x0021 SMID > 713^M > KDM> > (da18:mps0:0:22:0): SCSI command timeout on device handle 0x0020 SMID > 603^M > KDM> > > KDM> > and hangs there forever (in zio state). > KDM> > > KDM> > I've prepared debugging kernel with DDB and would be glad to help > catch the > KDM> > situation. > KDM> > KDM> Hmm... > KDM> > KDM> Can you send full dmesg output? > > Attached
Thanks. It looks like you have a SAS2008, with the 4.0 firmware. I think it would be worthwhile to upgrade to the 9.0 firmware. I know for sure there are issues with the 2.0 firmware, and I know the 9.0 firmware works fairly well. I don't know whether the 4.0 firmware has any severe issues, but it would be good to eliminate firmware bugs before we chase driver issues. > KDM> What I'm most interested in is whether > KDM> there is more kernel output before the IOC Fault that might shed some > light > KDM> on what is going on. > > Nope. I use boot_verbose, but none of mps-related debug options yet Okay. If there's nothing before the IOC fault message, then we really don't have any clues to what caused the fault... The rest is just fallout from the IOC fault. > KDM> > KDM> Also, what brand (LSI, Maxim, etc.) and speed (3Gb, 6Gb) is the expander > on > KDM> the backplane? > > LSI 6G: Okay. > KDM> What model LSI controller do you have? How many lanes are connected > KDM> between the controller and the backplane? > > 2x4 IIR. BTW, how can investigate real SASA topology? So 8 lanes total? That's what I wanted to know. The primary thing I'm getting at is to see how much lane contention we may have. With 24 SATA disks, you can only talk to 8 at a time with 8 lanes connected from the controller to the backplane. I've run into issues with a lot of contention with SATA drives, but that was with a 3Gb Maxim expander. In theory things should work better with an LSI expander. (You would think that they test scenarios like yours.) > KDM> What model disks do you have in the system? (dmesg will show that > KDM> obviously.) > > 24 x WD RE4 2T Ok. My SATA testing has been primarily with WD 2TB drives as well. > KDM> Hopefully we can find some clues to point to the problem. > > /me too ;) > > Thank you very much! > > BTW, I have serial console, DDB kernel, so while this machine is in > production, but not too heavy, and I can spend some time in kernel debugger > if > needed. Well, I think the first thing to do is upgrade the firmware and see if that fixes it. If not, we'll start instrumenting things and see how much information we can get about the cause of the fault. Ken -- Kenneth Merry [email protected] _______________________________________________ [email protected] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "[email protected]"
