Hi Jean-Francois, On Tue, 2008-04-29 at 10:17 +0200, Jean-Francois.Neyroud wrote: > If I attemp to query at the same time the performance counters on all > nodes on a cluster ( 40 nodes) . > perfquery causes kernel to be stuck in ib_unregister_mad_agent() function. > > Impossible to send CTRL-C or CTRL-Z to perfquery, it is stuck in the kernel. > # pgrep perfquery > 27578 > # cat /proc/27578/wchan > ib_unregister_mad_agent > > I have this problem with OFED-1.2.5 or 1.3 and with mthca or ConnectX, > not tested with others HCA and OFED. > > Reproduceur with 2 nodes and without switch: > > # for i in `seq 1 100`; do perfquery >/dev/null 2>&1 & done > > # pgrep perfquery | while read pid; do echo "$pid: `cat /proc/$pid/wchan`"; > echo; done | dshbak -c > ---------------- > [14936,14938-15029] > ---------------- > 0 > ---------------- > > ---------------- > ---------------- > 14937 > ---------------- > flush_cpu_workqueue > > > Does anyone know this problem ?
This could be related to the lock dependency issue discussed in the following thread: http://lists.openfabrics.org/pipermail/general/2008-January/044723.html You might want to look to the following for the actual fix: commit 2fe7e6f7c9f55eac24c5b3cdf56af29ab9b0ca81 Author: Roland Dreier <[EMAIL PROTECTED]> Date: Fri Jan 25 14:15:42 2008 -0800 IB/umad: Simplify and fix locking In addition to being overly complex, the locking in user_mad.c is broken: there were multiple reports of deadlocks and lockdep warnings. In particular it seems that a single thread may end up trying to take the same rwsem for reading more than once, which is explicitly forbidden in the comments in <linux/rwsem.h>. To solve this, we change the locking to use plain mutexes instead of rwsems. There is one mutex per open file, which protects the contents of the struct ib_umad_file, including the array of agents and list of queued packets; and there is one mutex per struct ib_umad_port, which protects the contents, including the list of open files. We never hold the file mutex across calls to functions like ib_unregister_mad_agent() , which can call back into other ib_umad code to queue a packet, and we always hold the port mutex as long as we need to make sure that a device is not hot-unplugged from under us. This even makes things nicer for users of the -rt patch, since we remove calls to downgrade_write() (which is not implemented in -rt). Signed-off-by: Roland Dreier <[EMAIL PROTECTED]> I don't think this change was incorporated into either OFED 1.2.5 or 1.3. -- Hal > > Jean-Francois. > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
