I wasted a bunch of time today debugging a scenario where openib-
>add_procs() was (legitimately) failing during MPI_INIT.
Specifically: an openib BTL module had successfully been initialized,
but then was failing during add_procs(). I say "legitimately" failing
because something external was causing add_procs to fail (i.e., a
misconfiguration on my cluster). By "fail", I mean add_procs()
returned != OMPI_SUCCESS.
The problem is that OMPI does not handle this situation gracefully;
every MPI process dumped core.
My question is: what exactly should happen when BTL add_procs()
fails? Is the BTL expected to recover? What if the BTL has no procs
as a result of this failure; should the PML (or BML) remove it from
progress loops? Or should the BTL be able to handle if progress is
called on its component? (which seems kinda wasteful)
Or should the failure of add_procs() be a fatal error? If so, what
should the BTL do? The PML's error_cb has not yet been registered,
and returning != OMPI_SUCCESS does not [currently] cause the PML to
abort. This fact seems to indicate to me that the PML/BTL designers
envisioned that the MPI process should be able to continue. But I'm
not sure that I agree with that assessment: we have a successfully
initialized BTL module, but an error occurred during add_procs().
Shouldn't we gracefully abort?
My $0.02:
- if the BTL returns != OMPI_SUCCESS from add_procs(), the PML should
gracefully abort.
- if a BTL fails add_procs() in a non-fatal way, it can set all
reachable bits to 0 and return OMPI_SUCCESS. The PML will therefore
effectively ignore it.
Comments? I'd like to fix the openib btl's add_procs() one way or
another for v1.3.
--
Jeff Squyres
Cisco Systems