WHAT: Fix missing code in MPI_Abort WHY: MPI_Abort is missing logic to ask for termination of the process group defined by the communicator
WHERE: Mostly orte/mca/errmgr WHEN: Open MPI trunk TIMEOUT: Tuesday, June 14, 2011 (after teleconf) Details: ------------------------------------------- A bitbucket branch is available here (last sync to r24757 of trunk) https://bitbucket.org/jjhursey/ompi-abort/ In the MPI Standard (v2.2) Section 8.7 after the introduction of MPI_Abort, it states: "This routine makes a best attempt to abort all tasks in the group of comm." Open MPI currently only calls orte_errmgr.abort() to abort the calling process itself. The code to ask for the abort of the other processes in the group defined by the communicator is commented out. Since one process calling abort currently causes all processes in the job to abort, it has not been a big deal. However as the group starts exploring better resilience in the OMPI layer (with further support from the ORTE layer) this aspect of MPI_Abort will become more necessary to get right. This branch adds back the logic necessary for a single process calling MPI_Abort to request, from ORTE errmgr, that a defined subgroup of processes be aborted. Once the request is sent to the HNP, the local process then calls abort on itself. The HNP requests that the defined subgroup of processes be terminated using the existing plm mechanisms for doing so. This change has no effect on the current default user experienced behavior of MPI_Abort. -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey