WHAT: Decide upon how to handle MPI applications where one or more processes exit without calling MPI_Finalize
WHY: Some applications can abort via an exit call instead of calling MPI_Abort when a library (or something else) calls exit. This situation is outside a user's control, so they cannot fix it. WHERE: Refer to ticket #1144 - code changes are TBD WHEN: Up to the group TIMEOUT: N/A ============================================================ A user has reported (see ticket #1144) a situation where their fortran library can call "exit" on a process in their application. This causes Open MPI to "hang" when the remaining processes reach MPI_Finalize as we have a barrier function at the beginning of that procedure. There are several possible ways we could resolve this problem, all of which have their own issues - here are two that immediately come to mind: (a) the RTE could detect that a process exit'd without calling finalize, and instigate an abort sequence. This would require two things: (1) inserting something into mpi_init that notifies the RTE "I am an MPI process" so the RTE knows to look for finalize; and (2) inserting a call to notify the RTE that we have indeed called finalize. We have both of these right now in ORTE, but we had agreed that we were to reduce the RTE's involvement in MPI - hence, the revised ORTE no longer has such detailed knowledge of an MPI process' state. I could reinsert it, of course - but that does seem to go away from what the MPI community here had requested. It also introduces possible race conditions, though we may be able to control those to some extent, and we couldn't provide that coverage in all environments (e.g., Cray). (b) we could remove the barrier in MPI_Finalize. While this would resolve this particular user's cited problem, I'm not convinced it would really solve the overall problem. For example, if one proc calls exit and the others enter a collective operation, I believe we will still hang. In addition, it was my understanding that the barrier in finalize was required to ensure that this exact scenario did not occur - that all procs remained alive until everyone was done just to ensure that a collective operation would not hang. Is this not true? Does the general community feel we should do anything here, or is this a "bug" that should be fixed by the entity calling "exit"? I should note that it actually is bad behavior (IMHO) for any library to call "exit" - but then, we do that in some situations too, so perhaps we shouldn't cast stones! Any suggested solutions or comments on whether or not we should do anything would be appreciated. Ralph