I have no issue with uncommenting the code. However, I do see a future littered with lots of zombied processes and complaints over poor cleanup again....
On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: > Ah I see what you are getting at now. > > The construction of the list of connected processes is something I, > intentionally, did not modify from the current Open MPI code. The list is > calculated based on the locally known set of local and remote process groups > attached to the communicator. So this is the set of directly connected > processes in the specified communicator known to the calling process at the > OMPI level. > > ORTE is asked to abort this defined set of processes. Once those processes > are terminated then ORTE needs to eventually inform all of the processes (in > the jobid(s) specified - maybe other jobids too?) that these processes have > failed/aborted. Upon notification of the failed/aborted processes the local > process (at the OMPI level) needs to determine if that process loss is > critical based upon the error handlers attached to communicators that it > shares with the failed/aborted processes. That should be handled in the > callback from the errmgr at the OMPI level, since connectedness is an MPI > construct. If the process failure/abort is critical to the local process, > then upon notification the local process can call abort on the communicator > effected. > > So this has the possibility for a rolling abort effect [the abort of one > communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From > which (depending upon the error handlers at the user level) the system will > eventually converge to either some stable subset of process or all processes > aborting resulting in job termination. > > The rolling abort effect relies heavily upon the ability of the runtime to > make sure that all process failures/abort are eventually known to all alive > processes. Since all alive processes will know of the failure/abort, it can > then determine if they are transitively effected by the failure based upon > the local list of communicators and associated error handlers. But to > complete this aspect of the abort procedure, we do need the callback > mechanism from the runtime - but since ORTE (today) will kill the job for > OMPI then it is not a big deal for end users since the job will terminate > anyway. Once we have the callback, then we can finish tightening up the OMPI > layer code. > > It is not perfect, but I think it does address the transitive nature of the > connectivity of MPI processes by relying on the runtime to provide uniform > notification of failures. I figure that we will need to look over this code > again and verify that the implementation of MPI_Comm_disconnect and > associated underpinnings do the 'right thing' with regard to updating the > communicator structures. But I think that is best addressed as a second set > of patches. > > > The goal of this patch is to put back in functionality that was commented out > during the last reorganization of the errmgr. What will likely follow, once > we have notification of failure/abort at the OMPI level, is a cleanup of the > connected groups code paths. > > > -- Josh > > > On Jun 9, 2011, at 6:13 PM, George Bosilca wrote: > >> What I'm saying is that there is no reason to have any other type of >> MPI_Abort if we are not able to compute the set of connected processes. >> >> With this RFC the processes on the communicator on MPI_Abort will abort. >> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will be >> notified (if we suppose that the ORTE will not make a difference between >> aborted and faulty). As a result the entire MPI_COMM_WORLD will be aborted, >> if we consider a sane application where everyone use the same type of error >> handler. However, this is not enough. We have to distribute the abort signal >> to every other process "connected", and I don't see how we can compute this >> list of connected processes in Open MPI today.It is not that I don't see it >> in your patch, it is that the definition of the connectivity in the MPI >> standard is transitive and relies heavily on a correct implementation for >> the MPI_Comm_disconnect. >> >> george. >> >> On Jun 9, 2011, at 16:59 , Josh Hursey wrote: >> >>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: >>>> If this change the behavior of MPI_Abort to only abort processes on the >>>> specified communicator how this doesn't affects the default user >>>> experience (when today it aborts everything)? >>> >>> Open MPI does abort everything by default - decided by the runtime at >>> the moment (but addressed in your RFC). So it does not matter if one >>> process aborts or if many do. So the behavior of MPI_Abort experienced >>> by the user will not change. Effectively the only change is an extra >>> message in the runtime before the process actually calls >>> errmgr.abort(). >>> >>> This branch just makes the implementation complete by first telling >>> ORTE that a group of processes, defined by the communicator, should be >>> terminated along with the calling process. Currently ORTE notices that >>> there was an abort, and terminates the job. Once your RFC goes through >>> then this may no longer be the case, and OMPI can determine what to do >>> when it receives a process failure notification. >>> >>>> >>>> If we accept the fact that MPI_Abort will only abort the processes in the >>>> current communicator what happens with the other processes in the same >>>> MPI_COMM_WORLD (but not on the communicator that has been used by >>>> MPI_Abort)? >>> >>> Currently, ORTE will abort them as well. When your RFC goes through >>> then the OMPI layer will be notified of the error and can take the >>> appropriate action, as determined by the MPI standard. >>> >>>> What about all the other connected processes (based on the connectivity as >>>> defined in the MPI standard in Section 10.5.4) ? Do they see this as a >>>> fault? >>> >>> They are informed of the fault via the ORTE errmgr callback routine >>> (that we have an RFC for), and then can take the appropriate action >>> based on MPI semantics. So we are pushing the decision of the >>> implication of the fault to the OMPI layer - where it should be. >>> >>> >>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other >>> connected error management scenarios is not included in this patch >>> since that depends on there being a callback to the OMPI layer - which >>> does not exist just yet. So a small patch to wire in the ORTE piece to >>> allow the OMPI layer to request a set of processes to be terminated - >>> to more accurately support MPI_Abort semantics. >>> >>> Does that answer your questions? >>> >>> -- Josh >>> >>> >>>> >>>> george. >>>> >>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote: >>>> >>>>> WHAT: Fix missing code in MPI_Abort >>>>> >>>>> WHY: MPI_Abort is missing logic to ask for termination of the process >>>>> group defined by the communicator >>>>> >>>>> WHERE: Mostly orte/mca/errmgr >>>>> >>>>> WHEN: Open MPI trunk >>>>> >>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf) >>>>> >>>>> Details: >>>>> ------------------------------------------- >>>>> A bitbucket branch is available here (last sync to r24757 of trunk) >>>>> https://bitbucket.org/jjhursey/ompi-abort/ >>>>> >>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of >>>>> MPI_Abort, it states: >>>>> "This routine makes a best attempt to abort all tasks in the group of >>>>> comm." >>>>> >>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling >>>>> process itself. The code to ask for the abort of the other processes >>>>> in the group defined by the communicator is commented out. Since one >>>>> process calling abort currently causes all processes in the job to >>>>> abort, it has not been a big deal. However as the group starts >>>>> exploring better resilience in the OMPI layer (with further support >>>>> from the ORTE layer) this aspect of MPI_Abort will become more >>>>> necessary to get right. >>>>> >>>>> This branch adds back the logic necessary for a single process calling >>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of >>>>> processes be aborted. Once the request is sent to the HNP, the local >>>>> process then calls abort on itself. The HNP requests that the defined >>>>> subgroup of processes be terminated using the existing plm mechanisms >>>>> for doing so. >>>>> >>>>> This change has no effect on the current default user experienced >>>>> behavior of MPI_Abort. >>>>> >>>>> -- >>>>> Joshua Hursey >>>>> Postdoctoral Research Associate >>>>> Oak Ridge National Laboratory >>>>> http://users.nccs.gov/~jjhursey >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> >>> >>> >>> -- >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel