Just a reminder for those not on the call that this RFC is scheduled to go in later today.
-- Josh On Fri, Jun 10, 2011 at 8:53 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote: > >> Why would this patch result in zombied processes and poor cleanup? >> When ORTE receive notification of a process terminating/aborting then >> it triggers the termination of the job (without UTK's RFC) which >> should ensure a clean shutdown. This patch just tells ORTE that a few >> other processes should be the first to die, which will trigger the >> same response in ORTE. >> >> I guess I'm unclear about this concern since it should be a concern in >> the current ORTE as well then. I agree that it will be a concern once >> we have the OMPI layer handling error management triggered off of a >> callback, but that is a different RFC. > > My comment was to "the future" - i.e., looking to the point where we get > layered, rolling aborts. > > I agree that this specific RFC won't change the current behavior, and as I > said, I have no issue with it. > > >> >> >> Something that might help those listening to this thread. The current >> behavior of MPI_Abort in OMPI results in the semantics of: >> -------------- >> internal_MPI_Abort(MPI_COMM_SELF, exit_code) >> -------------- >> regardless of the communicator actually passed to the MPI_Abort at the >> application level. It should be: >> -------------- >> internal_MPI_Abort(comm_provided, exit_code) >> -------------- >> >> Semantically, this patch just makes the group actually being aborted >> match the communicator provided. In practicality, the job will >> terminate when any process in the job calls abort - so the result (in >> todays codebase) will be the same. >> >> -- Josh >> >> >> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> I have no issue with uncommenting the code. However, I do see a future >>> littered with lots of zombied processes and complaints over poor cleanup >>> again.... >>> >>> >>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: >>> >>>> Ah I see what you are getting at now. >>>> >>>> The construction of the list of connected processes is something I, >>>> intentionally, did not modify from the current Open MPI code. The list is >>>> calculated based on the locally known set of local and remote process >>>> groups attached to the communicator. So this is the set of directly >>>> connected processes in the specified communicator known to the calling >>>> process at the OMPI level. >>>> >>>> ORTE is asked to abort this defined set of processes. Once those processes >>>> are terminated then ORTE needs to eventually inform all of the processes >>>> (in the jobid(s) specified - maybe other jobids too?) that these processes >>>> have failed/aborted. Upon notification of the failed/aborted processes the >>>> local process (at the OMPI level) needs to determine if that process loss >>>> is critical based upon the error handlers attached to communicators that >>>> it shares with the failed/aborted processes. That should be handled in >>>> the callback from the errmgr at the OMPI level, since connectedness is an >>>> MPI construct. If the process failure/abort is critical to the local >>>> process, then upon notification the local process can call abort on the >>>> communicator effected. >>>> >>>> So this has the possibility for a rolling abort effect [the abort of one >>>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From >>>> which (depending upon the error handlers at the user level) the system >>>> will eventually converge to either some stable subset of process or all >>>> processes aborting resulting in job termination. >>>> >>>> The rolling abort effect relies heavily upon the ability of the runtime to >>>> make sure that all process failures/abort are eventually known to all >>>> alive processes. Since all alive processes will know of the failure/abort, >>>> it can then determine if they are transitively effected by the failure >>>> based upon the local list of communicators and associated error handlers. >>>> But to complete this aspect of the abort procedure, we do need the >>>> callback mechanism from the runtime - but since ORTE (today) will kill the >>>> job for OMPI then it is not a big deal for end users since the job will >>>> terminate anyway. Once we have the callback, then we can finish tightening >>>> up the OMPI layer code. >>>> >>>> It is not perfect, but I think it does address the transitive nature of >>>> the connectivity of MPI processes by relying on the runtime to provide >>>> uniform notification of failures. I figure that we will need to look over >>>> this code again and verify that the implementation of MPI_Comm_disconnect >>>> and associated underpinnings do the 'right thing' with regard to updating >>>> the communicator structures. But I think that is best addressed as a >>>> second set of patches. >>>> >>>> >>>> The goal of this patch is to put back in functionality that was commented >>>> out during the last reorganization of the errmgr. What will likely follow, >>>> once we have notification of failure/abort at the OMPI level, is a cleanup >>>> of the connected groups code paths. >>>> >>>> >>>> -- Josh >>>> >>>> >>>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote: >>>> >>>>> What I'm saying is that there is no reason to have any other type of >>>>> MPI_Abort if we are not able to compute the set of connected processes. >>>>> >>>>> With this RFC the processes on the communicator on MPI_Abort will abort. >>>>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will >>>>> be notified (if we suppose that the ORTE will not make a difference >>>>> between aborted and faulty). As a result the entire MPI_COMM_WORLD will >>>>> be aborted, if we consider a sane application where everyone use the same >>>>> type of error handler. However, this is not enough. We have to distribute >>>>> the abort signal to every other process "connected", and I don't see how >>>>> we can compute this list of connected processes in Open MPI today.It is >>>>> not that I don't see it in your patch, it is that the definition of the >>>>> connectivity in the MPI standard is transitive and relies heavily on a >>>>> correct implementation for the MPI_Comm_disconnect. >>>>> >>>>> george. >>>>> >>>>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote: >>>>> >>>>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> >>>>>> wrote: >>>>>>> If this change the behavior of MPI_Abort to only abort processes on the >>>>>>> specified communicator how this doesn't affects the default user >>>>>>> experience (when today it aborts everything)? >>>>>> >>>>>> Open MPI does abort everything by default - decided by the runtime at >>>>>> the moment (but addressed in your RFC). So it does not matter if one >>>>>> process aborts or if many do. So the behavior of MPI_Abort experienced >>>>>> by the user will not change. Effectively the only change is an extra >>>>>> message in the runtime before the process actually calls >>>>>> errmgr.abort(). >>>>>> >>>>>> This branch just makes the implementation complete by first telling >>>>>> ORTE that a group of processes, defined by the communicator, should be >>>>>> terminated along with the calling process. Currently ORTE notices that >>>>>> there was an abort, and terminates the job. Once your RFC goes through >>>>>> then this may no longer be the case, and OMPI can determine what to do >>>>>> when it receives a process failure notification. >>>>>> >>>>>>> >>>>>>> If we accept the fact that MPI_Abort will only abort the processes in >>>>>>> the current communicator what happens with the other processes in the >>>>>>> same MPI_COMM_WORLD (but not on the communicator that has been used by >>>>>>> MPI_Abort)? >>>>>> >>>>>> Currently, ORTE will abort them as well. When your RFC goes through >>>>>> then the OMPI layer will be notified of the error and can take the >>>>>> appropriate action, as determined by the MPI standard. >>>>>> >>>>>>> What about all the other connected processes (based on the connectivity >>>>>>> as defined in the MPI standard in Section 10.5.4) ? Do they see this as >>>>>>> a fault? >>>>>> >>>>>> They are informed of the fault via the ORTE errmgr callback routine >>>>>> (that we have an RFC for), and then can take the appropriate action >>>>>> based on MPI semantics. So we are pushing the decision of the >>>>>> implication of the fault to the OMPI layer - where it should be. >>>>>> >>>>>> >>>>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other >>>>>> connected error management scenarios is not included in this patch >>>>>> since that depends on there being a callback to the OMPI layer - which >>>>>> does not exist just yet. So a small patch to wire in the ORTE piece to >>>>>> allow the OMPI layer to request a set of processes to be terminated - >>>>>> to more accurately support MPI_Abort semantics. >>>>>> >>>>>> Does that answer your questions? >>>>>> >>>>>> -- Josh >>>>>> >>>>>> >>>>>>> >>>>>>> george. >>>>>>> >>>>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote: >>>>>>> >>>>>>>> WHAT: Fix missing code in MPI_Abort >>>>>>>> >>>>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process >>>>>>>> group defined by the communicator >>>>>>>> >>>>>>>> WHERE: Mostly orte/mca/errmgr >>>>>>>> >>>>>>>> WHEN: Open MPI trunk >>>>>>>> >>>>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf) >>>>>>>> >>>>>>>> Details: >>>>>>>> ------------------------------------------- >>>>>>>> A bitbucket branch is available here (last sync to r24757 of trunk) >>>>>>>> https://bitbucket.org/jjhursey/ompi-abort/ >>>>>>>> >>>>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of >>>>>>>> MPI_Abort, it states: >>>>>>>> "This routine makes a best attempt to abort all tasks in the group of >>>>>>>> comm." >>>>>>>> >>>>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling >>>>>>>> process itself. The code to ask for the abort of the other processes >>>>>>>> in the group defined by the communicator is commented out. Since one >>>>>>>> process calling abort currently causes all processes in the job to >>>>>>>> abort, it has not been a big deal. However as the group starts >>>>>>>> exploring better resilience in the OMPI layer (with further support >>>>>>>> from the ORTE layer) this aspect of MPI_Abort will become more >>>>>>>> necessary to get right. >>>>>>>> >>>>>>>> This branch adds back the logic necessary for a single process calling >>>>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of >>>>>>>> processes be aborted. Once the request is sent to the HNP, the local >>>>>>>> process then calls abort on itself. The HNP requests that the defined >>>>>>>> subgroup of processes be terminated using the existing plm mechanisms >>>>>>>> for doing so. >>>>>>>> >>>>>>>> This change has no effect on the current default user experienced >>>>>>>> behavior of MPI_Abort. >>>>>>>> >>>>>>>> -- >>>>>>>> Joshua Hursey >>>>>>>> Postdoctoral Research Associate >>>>>>>> Oak Ridge National Laboratory >>>>>>>> http://users.nccs.gov/~jjhursey >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Joshua Hursey >>>>>> Postdoctoral Research Associate >>>>>> Oak Ridge National Laboratory >>>>>> http://users.nccs.gov/~jjhursey >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> >> >> -- >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey