On Jul 14, 2008, at 1:17 PM, Sean Hefty wrote:
I talked to Sean Hefty about it, but we never figured out a
definitive
cause or solution. My best guess is that there is something wonky
about multiple processes simultaneously interacting with the IBCM
kernel driver from userspace; but I don't know jack about kernel
stuff, so that's a total SWAG.
The only reason I can think of why ib_cm_listen() fails is if
there's a conflict
with the service_id and/or service_mask from multiple threads. What
does OMPI
pass in for these parameters?
The service ID that it uses is its PID and the mask is always 0.
There will only be one call to ib_cm_listen() per device per MPI
process.
Open MPI certainly could be buggy with IBCM, of course -- but it's
fishy that the same exact "mpirun ..." command line works one time and
fails the next (it's kinda random when the problem occurs).
--
Jeff Squyres
Cisco Systems