[OMPI devel] if btl->add_procs() fails...?

Jeff Squyres Fri, 1 Aug 2008 22:03:10 -0400

I wasted a bunch of time today debugging a scenario where openib->add_procs() was (legitimately) failing during MPI_INIT.Specifically: an openib BTL module had successfully been initialized,but then was failing during add_procs(). I say "legitimately" failingbecause something external was causing add_procs to fail (i.e., amisconfiguration on my cluster). By "fail", I mean add_procs()returned != OMPI_SUCCESS.

The problem is that OMPI does not handle this situation gracefully;every MPI process dumped core.

My question is: what exactly should happen when BTL add_procs()fails? Is the BTL expected to recover? What if the BTL has no procsas a result of this failure; should the PML (or BML) remove it fromprogress loops? Or should the BTL be able to handle if progress iscalled on its component? (which seems kinda wasteful)

Or should the failure of add_procs() be a fatal error? If so, whatshould the BTL do? The PML's error_cb has not yet been registered,and returning != OMPI_SUCCESS does not [currently] cause the PML toabort. This fact seems to indicate to me that the PML/BTL designersenvisioned that the MPI process should be able to continue. But I'mnot sure that I agree with that assessment: we have a successfullyinitialized BTL module, but an error occurred during add_procs().Shouldn't we gracefully abort?


My $0.02:

- if the BTL returns != OMPI_SUCCESS from add_procs(), the PML shouldgracefully abort.- if a BTL fails add_procs() in a non-fatal way, it can set allreachable bits to 0 and return OMPI_SUCCESS. The PML will thereforeeffectively ignore it.

Comments? I'd like to fix the openib btl's add_procs() one way oranother for v1.3.


--
Jeff Squyres
Cisco Systems

[OMPI devel] if btl->add_procs() fails...?

Reply via email to