I dug into this a bit. The problem is in the SM BTL init where it's waiting for all of the peers to set seg_inited in shared memory (so that it knows everyone has hit that point). We loop on calling opal_progress while waiting.
The problem is that opal_progress() is not returning (!). It appears that libevent's poll_dispatch() function is somehow getting an infinite timeout -- it *looks* like libevent is determining that there are no timers active, so it decides to set an infinite timeout (i.e., block) when it calls poll(). Specifically, event.c:1524 calls timeout_next(), which sees that there are no timer events active and resets tv_p to NULL. We then call the underlying fd-checking backend with an infinite timeout. Bonk. Anyone more familiar with libevent's internals know why this is happening / if this is a change since the old version? On Oct 25, 2010, at 6:07 PM, Jeff Squyres wrote: > On Oct 25, 2010, at 3:21 PM, George Bosilca wrote: > >> So now we're in good shape, at least for compiling. IB and TCP seem to work, >> but SM deadlock. > > Ugh. > > Are you debugging this, or are we? (i.e., me/Ralph) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/