WHAT:   Decide upon how to handle MPI applications where one or more
        processes exit without calling MPI_Finalize

WHY:    Some applications can abort via an exit call instead of
        calling MPI_Abort when a library (or something else) calls
        exit. This situation is outside a user's control, so they
        cannot fix it.

WHERE:  Refer to ticket #1144 - code changes are TBD

WHEN:   Up to the group

TIMEOUT: N/A

============================================================
A user has reported (see ticket #1144) a situation where their fortran
library can call "exit" on a process in their application. This causes Open
MPI to "hang" when the remaining processes reach MPI_Finalize as we have a
barrier function at the beginning of that procedure.

There are several possible ways we could resolve this problem, all of which
have their own issues - here are two that immediately come to mind:

(a) the RTE could detect that a process exit'd without calling finalize, and
instigate an abort sequence. This would require two things: (1) inserting
something into mpi_init that notifies the RTE "I am an MPI process" so the
RTE knows to look for finalize; and (2) inserting a call to notify the RTE
that we have indeed called finalize. We have both of these right now in
ORTE, but we had agreed that we were to reduce the RTE's involvement in MPI
- hence, the revised ORTE no longer has such detailed knowledge of an MPI
process' state. I could reinsert it, of course - but that does seem to go
away from what the MPI community here had requested. It also introduces
possible race conditions, though we may be able to control those to some
extent, and we couldn't provide that coverage in all environments (e.g.,
Cray).

(b) we could remove the barrier in MPI_Finalize. While this would resolve
this particular user's cited problem, I'm not convinced it would really
solve the overall problem. For example, if one proc calls exit and the
others enter a collective operation, I believe we will still hang. In
addition, it was my understanding that the barrier in finalize was required
to ensure that this exact scenario did not occur - that all procs remained
alive until everyone was done just to ensure that a collective operation
would not hang. Is this not true?

Does the general community feel we should do anything here, or is this a
"bug" that should be fixed by the entity calling "exit"? I should note that
it actually is bad behavior (IMHO) for any library to call "exit" - but
then, we do that in some situations too, so perhaps we shouldn't cast
stones!

Any suggested solutions or comments on whether or not we should do anything
would be appreciated.

Ralph


Reply via email to