I wasted a bunch of time today debugging a scenario where openib- >add_procs() was (legitimately) failing during MPI_INIT. Specifically: an openib BTL module had successfully been initialized, but then was failing during add_procs(). I say "legitimately" failing because something external was causing add_procs to fail (i.e., a misconfiguration on my cluster). By "fail", I mean add_procs() returned != OMPI_SUCCESS.

The problem is that OMPI does not handle this situation gracefully; every MPI process dumped core.

My question is: what exactly should happen when BTL add_procs() fails? Is the BTL expected to recover? What if the BTL has no procs as a result of this failure; should the PML (or BML) remove it from progress loops? Or should the BTL be able to handle if progress is called on its component? (which seems kinda wasteful)

Or should the failure of add_procs() be a fatal error? If so, what should the BTL do? The PML's error_cb has not yet been registered, and returning != OMPI_SUCCESS does not [currently] cause the PML to abort. This fact seems to indicate to me that the PML/BTL designers envisioned that the MPI process should be able to continue. But I'm not sure that I agree with that assessment: we have a successfully initialized BTL module, but an error occurred during add_procs(). Shouldn't we gracefully abort?

My $0.02:

- if the BTL returns != OMPI_SUCCESS from add_procs(), the PML should gracefully abort. - if a BTL fails add_procs() in a non-fatal way, it can set all reachable bits to 0 and return OMPI_SUCCESS. The PML will therefore effectively ignore it.

Comments? I'd like to fix the openib btl's add_procs() one way or another for v1.3.

--
Jeff Squyres
Cisco Systems

Reply via email to