It is an SDP bug.
The test that found the problem had a symptom where the "rmmod mlx4_ib"
command would hang in an uninterruptable sleep in cma_remove_one().
Also any attempt to unload ib_sdp would also hang. The original V1 post
of this patch was very light on detail and I took the V2 opportunity to
explain. Almost makes up for the stupid mistake in the first patch...
The process used to duplicate the bug and verify this fix is to use 3
nodes (1 just for SM to not confuse things), and execute the steps 1-7
below:
nod0: (MLX4)
0) opensm started
node1: (MLX4) [With ib_sdp loaded and LD_PRELOAD setup]
1) netserver
3) /sbin/rmmod mlx4_ib && /sbin/modprobe mlx4_ib (in parallel to 2)
*** HANGS before fix; Works after ***
6) killall netserver
7) modprobe -r ib_sdp
*** HANGS before fix; works after ***
node2: (MLX4) [With ib_sdp loaded and LD_PRELOAD setup]
2) netperf -C -c -P 0 -t TCP_STREAM -H green_ib -l 120 -- -m 1000000
4) after failure ^C or just wait for netperf to end on its own with
"netperf: cannot shutdown tcp stream socket: Transport endpoint
is not connected"
5) /etc/init.d/openibd stop
*** WORKS before and after fix ***
Thanks,
JIm
Jim Mott
Mellanox Technologies Ltd.
mail: [EMAIL PROTECTED]
Phone: 512-294-5481
-----Original Message-----
From: Roland Dreier [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 06, 2007 4:56 PM
To: Jim Mott
Cc: [EMAIL PROTECTED]
Subject: Re: [ofa-general] [PATCH 1/1 V2] SDP - Fix reference count bug
that prevents mlx4_ib and ib_sdp unload
What does this have to do with mlx4? It seems it is just a bug in SDP
related to hot-removing any device, right?
- R.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general