Re: [OMPI devel] if btl->add_procs() fails...?

Jeff Squyres Mon, 4 Aug 2008 09:41:01 -0400

On Aug 2, 2008, at 2:34 PM, Brian Barrett wrote:

I am curious how all of the above affects client/server or spawnedjobs. If you finalize a BTL then do a connect to a process thatwould use that BTL would it reinitialize itself?
To deal with all the dynamics issues, I wouldn't finalized the BTL.The BML should handle the progress stuff, just as if the add_procssucceeded but returned no active peers. But I'd guess that's part ofthe bit that doesn't work today. I would further suspect that a BTLwill need to have a working progress function in the face ofadd_procs failures to cope with all the dynamics options. I'mtravelling this weekend, so I can't verify any of this at the moment.

This seems a little different than the rest of the code base -- we'retalking about having the BTL return an error but have the upper levelnot treat it as a fatal error.

I think we actually have a few different situations ("fail" means "notreturning OMPI_SUCCESS"):

1. btl component init fails (only during MPI_INIT): the API supportsno notion of failure -- it either returns modules or not (i.e., validpointers or NULL). If NULL is returned, the component is ignored andunloaded.

2. btl add_procs during MPI_INIT fails: this is under debate
3. btl add_procs during MPI-2 dynamics fails: this is under debate

For #2 and #3, I suspect that only the BTL knows if it can continue ornot. For example, a failure in #3 may cause the entire BTL to behosed such that it can no longer communicate with procs that itpreviously successfully added (e.g., in MPI_INIT). So we really needadd_procs to be able to return multiple things:


A. OMPI_SUCCESS / all was good

B. a non-fatal error occurred such that this BTL cannot communicatewith the desired peers, but the upper level PML can continueC. a fatal error has occurred such that the upper level should abort(or, more specifically, do whatever the error manager says)

I think that for B in both #2 and #3, we can just have the BTL set allthe reachability bits to 0 and return OMPI_SUCCESS. But for C, theBTL should return != OMPI_SUCCESS. The PML should treat it as a fatalerror and therefore call the error manager.


I think that this is in-line with Brian's original comments, right?

--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] if btl->add_procs() fails...?

Reply via email to