Chris Worley, on 09/19/2009 01:31 AM wrote:
On Mon, Sep 7, 2009 at 5:58 AM, Vladislav Bolkhovitin <v...@vlnb.net> wrote:
Chris Worley, on 09/06/2009 05:41 PM wrote:
On Sun, Sep 6, 2009 at 3:36 PM, Chris Worley<worl...@gmail.com> wrote:
On Sun, Sep 6, 2009 at 3:17 PM, Bart Van Assche<bart.vanass...@gmail.com>
wrote:
On Fri, Sep 4, 2009 at 1:20 AM, Chris Worley <worl...@gmail.com> wrote:
On Thu, Sep 3, 2009 at 11:38 AM, Chris Worley<worl...@gmail.com> wrote:
I've used a couple of initiators (different systems) w/ different
OSes, w/ different IB cards (all QDR) and different IB stacks
(built-in vs. OFED) and can repeat the problem in all but the
RHEL5.2/OFED 1.4.1 target and initiator (but, if the initiator is
WinOF and the target is RHEL5.2/OFED1.4.1, then the problem does
repeat).
Here's a twist: I used the Ubuntu initiator w/ one of the RHEL
targets, and the RHEL initiator (same machine as was running WinOF
from the beginning of this thread) w/ one of the Ubuntu targets: in
both cases, the problem does not repeat.
That makes it sound like OFED is the cure on either side of the
connection, but does not explain the issue w/ WinOF (which does fail
w/ either Ununtu or RHEL targets).
These results are strange. Regarding the Linux-only tests, I was
assuming failure of a single component (Ubuntu SRP initiator, OFED SRP
initiator, Ubuntu IB driver, OFED IB driver or SRP target), but for
each of these components there is at least one test that passes and at
least one test that fails. So either my assumption is wrong or one of
the above test results is not repeatable. Do you have the time to
repeat the Linux-only tests ?
Last night I was rerunning the RHEL5.2 initiator w/ Ubuntu client, and
the problem repeated; now, I can't repeat the case where it didn't
fail. Still, no errors, other than the eventual timeouts previously
shown; the target thinks all is fine, the initiator is stuck.
... and I haven't had any success w/ Ubuntu target and initiator, 8.10 or
9.04.
1. Try with kernel parameter maxcpus=1. It will somehow relax possible races
you have, although not completely.
I finally got around to this test... 1 CPU works very well, w/o hangs
(will test all night to see if this holds true), 2 or more don't.
This is dual-socket NHM, so I can't specify more than one processor
w/o getting more than one socket.
Where 1 CPU works well, on the target or initiator? The race is on the
corresponding host.
I'd suggest you to reproduce the problem with the latest SCST trunk,
lockdep enabled on the suspected host (better on both) and mgmt_minor
trace level enabled on the target. Then, after the hang, let the system
stay for about a half an hour, then send us with Bart (privately,
compressed) kernel logs from both systems starting from the early boot
messages.
If you have dmesg only output, please enable printk timestamps
(CONFIG_PRINTK_TIME).
Chris
2. Try with another hardware, including motherboard. You can have something
like http://lkml.org/lkml/2007/7/31/558 (not exactly it, of course)
Chris
Chris
Bart.
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general