On Jun 2, 2010, at 12:02 PM, Ashley Pittman wrote: > > Ah, this is the key. If I have one process (out of many) fail the > > create_cq() function, I get a segv during finalize. I'll dig. > > Is there an assumption that if process A claims to be able to communicate > with process B that process B can also communicate with process A. It almost > sounds like the code needs to do a allreduce on the bitmask returned by the > btls.
Actually, this is exactly the case (I just dug into the code and verified this). In this case, we're already well beyond the point where we synchronized and decided who can connect to whom. I.e., the modex is already done -- the openib BTL in process X has decided that it is available and has advertised its RDMACM CPC and OOB CPC contact info. But then later in process X during the openib BTL add_procs, something fails. So the openib clears the connect bits and transparently fails over to TCP. No problem. The problem is the other peers who think that they can still connect to process X via the openib BTL. 1. In this case, the openib BTL was not finalized, so there was a stub still there listening on the RDMACM CPC. When another process tried to connect to X's RDMACM CPC port, Bad Things happened (because it was only half setup) and we segv'ed. Obviously, this should be fixed. "Fixed" in this case probably means closing down the RDMACM CPC listening port. But then that leads to another form of Badness. 2. If the openib BTL cleanly shuts down and is *not* still listening on its modex-advertised RDMACM CPC contact port, then if some other process tries to contact process X via the modex info, it'll fail. This will then be judged to be a fatal error. Failover in the BML will simply have delayed the job abort until someone tries to contact X via the openib BTL. I think that the majority of this discussion about the BML failure (or not) behavior assumed that *all* processes had the same failure (at least: *I* assumed this). But if only *some* of the processes fail a given BTL add_procs, we have a problem because we're beyond the point of deciding who can connect to whom. Shutting down a single BTL module at that point will create an inconsistency of the distributed data. That seems wrong. What to do? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/