On Jun 2, 2010, at 12:18 , Jeff Squyres wrote:

> On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote:
> 
>>> Ah, this is the key.  If I have one process (out of many) fail the 
>>> create_cq() function, I get a segv during finalize.  I'll dig.
>> 
>> Is there an assumption that if process A claims to be able to communicate 
>> with process B that process B can also communicate with process A.  It 
>> almost sounds like the code needs to do a allreduce on the bitmask returned 
>> by the btls.
> 
> Actually, this is exactly the case (I just dug into the code and verified 
> this).
> 
> In this case, we're already well beyond the point where we synchronized and 
> decided who can connect to whom.  I.e., the modex is already done -- the 
> openib BTL in process X has decided that it is available and has advertised 
> its RDMACM CPC and OOB CPC contact info.
> 
> But then later in process X during the openib BTL add_procs, something fails. 
>  So the openib clears the connect bits and transparently fails over to TCP.  
> No problem.
> 
> The problem is the other peers who think that they can still connect to 
> process X via the openib BTL.
> 
> 1. In this case, the openib BTL was not finalized, so there was a stub still 
> there listening on the RDMACM CPC.  When another process tried to connect to 
> X's RDMACM CPC port, Bad Things happened (because it was only half setup) and 
> we segv'ed.
> 
> Obviously, this should be fixed.  "Fixed" in this case probably means closing 
> down the RDMACM CPC listening port.  But then that leads to another form of 
> Badness.

I wonder how this is possible. If a process X fails to connect to Y, how can Y 
succeed to connect to X ? Please enlighten me ...

> 
> 2. If the openib BTL cleanly shuts down and is *not* still listening on its 
> modex-advertised RDMACM CPC contact port, then if some other process tries to 
> contact process X via the modex info, it'll fail.  This will then be judged 
> to be a fatal error.  Failover in the BML will simply have delayed the job 
> abort until someone tries to contact X via the openib BTL.

Isn't there any kind of timeout mechanism in the RDMACM CPC? If there is one 
and the connection fails, then the PML will automatically try to use the next 
available BTL, so it will eventually fail over TCP (if available).

> 
> I think that the majority of this discussion about the BML failure (or not) 
> behavior assumed that *all* processes had the same failure (at least: *I* 
> assumed this).  But if only *some* of the processes fail a given BTL 
> add_procs, we have a problem because we're beyond the point of deciding who 
> can connect to whom.  Shutting down a single BTL module at that point will 
> create an inconsistency of the distributed data.

We did assume that at least the errors are symmetric, i.e. if A fails to 
connect to B then B will fail when trying to connect to A. However, if there 
are other BTL the connection is supposed to smoothly move over some other BTL. 
As an example in the MX BTL, if two nodes have MX support, but they do not 
share the same mapper the add_procs will silently fails.

  george.

> 
> That seems wrong.
> 
> What to do?
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to