I dug into this a bit.  

The problem is in the SM BTL init where it's waiting for all of the peers to 
set seg_inited in shared memory (so that it knows everyone has hit that point). 
 We loop on calling opal_progress while waiting.

The problem is that opal_progress() is not returning (!).

It appears that libevent's poll_dispatch() function is somehow getting an 
infinite timeout -- it *looks* like libevent is determining that there are no 
timers active, so it decides to set an infinite timeout (i.e., block) when it 
calls poll().  Specifically, event.c:1524 calls timeout_next(), which sees that 
there are no timer events active and resets tv_p to NULL.  We then call the 
underlying fd-checking backend with an infinite timeout.  

Bonk.

Anyone more familiar with libevent's internals know why this is happening / if 
this is a change since the old version?



On Oct 25, 2010, at 6:07 PM, Jeff Squyres wrote:

> On Oct 25, 2010, at 3:21 PM, George Bosilca wrote:
> 
>> So now we're in good shape, at least for compiling. IB and TCP seem to work, 
>> but SM deadlock.
> 
> Ugh.
> 
> Are you debugging this, or are we? (i.e., me/Ralph)
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to