On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:

> > Ah, this is the key.  If I have one process (out of many) fail the 
> > create_cq() function, I get a segv during finalize.  I'll dig.
> 
> Is there an assumption that if process A claims to be able to communicate 
> with process B that process B can also communicate with process A.  It almost 
> sounds like the code needs to do a allreduce on the bitmask returned by the 
> btls.

Actually, this is exactly the case (I just dug into the code and verified this).

In this case, we're already well beyond the point where we synchronized and 
decided who can connect to whom.  I.e., the modex is already done -- the openib 
BTL in process X has decided that it is available and has advertised its RDMACM 
CPC and OOB CPC contact info.

But then later in process X during the openib BTL add_procs, something fails.  
So the openib clears the connect bits and transparently fails over to TCP.  No 
problem.

The problem is the other peers who think that they can still connect to process 
X via the openib BTL.

1. In this case, the openib BTL was not finalized, so there was a stub still 
there listening on the RDMACM CPC.  When another process tried to connect to 
X's RDMACM CPC port, Bad Things happened (because it was only half setup) and 
we segv'ed.

Obviously, this should be fixed.  "Fixed" in this case probably means closing 
down the RDMACM CPC listening port.  But then that leads to another form of 
Badness.

2. If the openib BTL cleanly shuts down and is *not* still listening on its 
modex-advertised RDMACM CPC contact port, then if some other process tries to 
contact process X via the modex info, it'll fail.  This will then be judged to 
be a fatal error.  Failover in the BML will simply have delayed the job abort 
until someone tries to contact X via the openib BTL.

I think that the majority of this discussion about the BML failure (or not) 
behavior assumed that *all* processes had the same failure (at least: *I* 
assumed this).  But if only *some* of the processes fail a given BTL add_procs, 
we have a problem because we're beyond the point of deciding who can connect to 
whom.  Shutting down a single BTL module at that point will create an 
inconsistency of the distributed data.

That seems wrong.

What to do?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to