My thought is that if add_procs fails, then that BTL should be removed
(as if init failed) and things should continue on. If that BTL was
the only way to reach another process, we'll catch that later and abort.
There are always going to be errors that can't be detected until the
device is actually used, so I think that add_procs errors should be
treated the same as init errors. The error_cb is a red herring, as
that's supposed to be used in situations where an error can't directly
be returned to the upper layers (like the progress function). In this
case, we can directly return an error, so we should do so (and I
believe we do, it's the BML/PML that's the problem).
Brian
On Aug 1, 2008, at 8:03 PM, Jeff Squyres wrote:
I wasted a bunch of time today debugging a scenario where openib-
>add_procs() was (legitimately) failing during MPI_INIT.
Specifically: an openib BTL module had successfully been
initialized, but then was failing during add_procs(). I say
"legitimately" failing because something external was causing
add_procs to fail (i.e., a misconfiguration on my cluster). By
"fail", I mean add_procs() returned != OMPI_SUCCESS.
The problem is that OMPI does not handle this situation gracefully;
every MPI process dumped core.
My question is: what exactly should happen when BTL add_procs()
fails? Is the BTL expected to recover? What if the BTL has no
procs as a result of this failure; should the PML (or BML) remove it
from progress loops? Or should the BTL be able to handle if
progress is called on its component? (which seems kinda wasteful)
Or should the failure of add_procs() be a fatal error? If so, what
should the BTL do? The PML's error_cb has not yet been registered,
and returning != OMPI_SUCCESS does not [currently] cause the PML to
abort. This fact seems to indicate to me that the PML/BTL designers
envisioned that the MPI process should be able to continue. But I'm
not sure that I agree with that assessment: we have a successfully
initialized BTL module, but an error occurred during add_procs().
Shouldn't we gracefully abort?
My $0.02:
- if the BTL returns != OMPI_SUCCESS from add_procs(), the PML
should gracefully abort.
- if a BTL fails add_procs() in a non-fatal way, it can set all
reachable bits to 0 and return OMPI_SUCCESS. The PML will therefore
effectively ignore it.
Comments? I'd like to fix the openib btl's add_procs() one way or
another for v1.3.
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel