Committed in r24775. https://svn.open-mpi.org/trac/ompi/changeset/24775
Sorry for the delay on this, I got side tracked yesterday. -- Josh On Tue, Jun 14, 2011 at 11:36 AM, Josh Hursey <jjhur...@open-mpi.org> wrote: > Just a reminder for those not on the call that this RFC is scheduled > to go in later today. > > -- Josh > > On Fri, Jun 10, 2011 at 8:53 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote: >> >>> Why would this patch result in zombied processes and poor cleanup? >>> When ORTE receive notification of a process terminating/aborting then >>> it triggers the termination of the job (without UTK's RFC) which >>> should ensure a clean shutdown. This patch just tells ORTE that a few >>> other processes should be the first to die, which will trigger the >>> same response in ORTE. >>> >>> I guess I'm unclear about this concern since it should be a concern in >>> the current ORTE as well then. I agree that it will be a concern once >>> we have the OMPI layer handling error management triggered off of a >>> callback, but that is a different RFC. >> >> My comment was to "the future" - i.e., looking to the point where we get >> layered, rolling aborts. >> >> I agree that this specific RFC won't change the current behavior, and as I >> said, I have no issue with it. >> >> >>> >>> >>> Something that might help those listening to this thread. The current >>> behavior of MPI_Abort in OMPI results in the semantics of: >>> -------------- >>> internal_MPI_Abort(MPI_COMM_SELF, exit_code) >>> -------------- >>> regardless of the communicator actually passed to the MPI_Abort at the >>> application level. It should be: >>> -------------- >>> internal_MPI_Abort(comm_provided, exit_code) >>> -------------- >>> >>> Semantically, this patch just makes the group actually being aborted >>> match the communicator provided. In practicality, the job will >>> terminate when any process in the job calls abort - so the result (in >>> todays codebase) will be the same. >>> >>> -- Josh >>> >>> >>> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> I have no issue with uncommenting the code. However, I do see a future >>>> littered with lots of zombied processes and complaints over poor cleanup >>>> again.... >>>> >>>> >>>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote: >>>> >>>>> Ah I see what you are getting at now. >>>>> >>>>> The construction of the list of connected processes is something I, >>>>> intentionally, did not modify from the current Open MPI code. The list is >>>>> calculated based on the locally known set of local and remote process >>>>> groups attached to the communicator. So this is the set of directly >>>>> connected processes in the specified communicator known to the calling >>>>> process at the OMPI level. >>>>> >>>>> ORTE is asked to abort this defined set of processes. Once those >>>>> processes are terminated then ORTE needs to eventually inform all of the >>>>> processes (in the jobid(s) specified - maybe other jobids too?) that >>>>> these processes have failed/aborted. Upon notification of the >>>>> failed/aborted processes the local process (at the OMPI level) needs to >>>>> determine if that process loss is critical based upon the error handlers >>>>> attached to communicators that it shares with the failed/aborted >>>>> processes. That should be handled in the callback from the errmgr at the >>>>> OMPI level, since connectedness is an MPI construct. If the process >>>>> failure/abort is critical to the local process, then upon notification >>>>> the local process can call abort on the communicator effected. >>>>> >>>>> So this has the possibility for a rolling abort effect [the abort of one >>>>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. >>>>> From which (depending upon the error handlers at the user level) the >>>>> system will eventually converge to either some stable subset of process >>>>> or all processes aborting resulting in job termination. >>>>> >>>>> The rolling abort effect relies heavily upon the ability of the runtime >>>>> to make sure that all process failures/abort are eventually known to all >>>>> alive processes. Since all alive processes will know of the >>>>> failure/abort, it can then determine if they are transitively effected by >>>>> the failure based upon the local list of communicators and associated >>>>> error handlers. But to complete this aspect of the abort procedure, we do >>>>> need the callback mechanism from the runtime - but since ORTE (today) >>>>> will kill the job for OMPI then it is not a big deal for end users since >>>>> the job will terminate anyway. Once we have the callback, then we can >>>>> finish tightening up the OMPI layer code. >>>>> >>>>> It is not perfect, but I think it does address the transitive nature of >>>>> the connectivity of MPI processes by relying on the runtime to provide >>>>> uniform notification of failures. I figure that we will need to look over >>>>> this code again and verify that the implementation of MPI_Comm_disconnect >>>>> and associated underpinnings do the 'right thing' with regard to updating >>>>> the communicator structures. But I think that is best addressed as a >>>>> second set of patches. >>>>> >>>>> >>>>> The goal of this patch is to put back in functionality that was commented >>>>> out during the last reorganization of the errmgr. What will likely >>>>> follow, once we have notification of failure/abort at the OMPI level, is >>>>> a cleanup of the connected groups code paths. >>>>> >>>>> >>>>> -- Josh >>>>> >>>>> >>>>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote: >>>>> >>>>>> What I'm saying is that there is no reason to have any other type of >>>>>> MPI_Abort if we are not able to compute the set of connected processes. >>>>>> >>>>>> With this RFC the processes on the communicator on MPI_Abort will abort. >>>>>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will >>>>>> be notified (if we suppose that the ORTE will not make a difference >>>>>> between aborted and faulty). As a result the entire MPI_COMM_WORLD will >>>>>> be aborted, if we consider a sane application where everyone use the >>>>>> same type of error handler. However, this is not enough. We have to >>>>>> distribute the abort signal to every other process "connected", and I >>>>>> don't see how we can compute this list of connected processes in Open >>>>>> MPI today.It is not that I don't see it in your patch, it is that the >>>>>> definition of the connectivity in the MPI standard is transitive and >>>>>> relies heavily on a correct implementation for the MPI_Comm_disconnect. >>>>>> >>>>>> george. >>>>>> >>>>>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote: >>>>>> >>>>>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> >>>>>>> wrote: >>>>>>>> If this change the behavior of MPI_Abort to only abort processes on >>>>>>>> the specified communicator how this doesn't affects the default user >>>>>>>> experience (when today it aborts everything)? >>>>>>> >>>>>>> Open MPI does abort everything by default - decided by the runtime at >>>>>>> the moment (but addressed in your RFC). So it does not matter if one >>>>>>> process aborts or if many do. So the behavior of MPI_Abort experienced >>>>>>> by the user will not change. Effectively the only change is an extra >>>>>>> message in the runtime before the process actually calls >>>>>>> errmgr.abort(). >>>>>>> >>>>>>> This branch just makes the implementation complete by first telling >>>>>>> ORTE that a group of processes, defined by the communicator, should be >>>>>>> terminated along with the calling process. Currently ORTE notices that >>>>>>> there was an abort, and terminates the job. Once your RFC goes through >>>>>>> then this may no longer be the case, and OMPI can determine what to do >>>>>>> when it receives a process failure notification. >>>>>>> >>>>>>>> >>>>>>>> If we accept the fact that MPI_Abort will only abort the processes in >>>>>>>> the current communicator what happens with the other processes in the >>>>>>>> same MPI_COMM_WORLD (but not on the communicator that has been used by >>>>>>>> MPI_Abort)? >>>>>>> >>>>>>> Currently, ORTE will abort them as well. When your RFC goes through >>>>>>> then the OMPI layer will be notified of the error and can take the >>>>>>> appropriate action, as determined by the MPI standard. >>>>>>> >>>>>>>> What about all the other connected processes (based on the >>>>>>>> connectivity as defined in the MPI standard in Section 10.5.4) ? Do >>>>>>>> they see this as a fault? >>>>>>> >>>>>>> They are informed of the fault via the ORTE errmgr callback routine >>>>>>> (that we have an RFC for), and then can take the appropriate action >>>>>>> based on MPI semantics. So we are pushing the decision of the >>>>>>> implication of the fault to the OMPI layer - where it should be. >>>>>>> >>>>>>> >>>>>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other >>>>>>> connected error management scenarios is not included in this patch >>>>>>> since that depends on there being a callback to the OMPI layer - which >>>>>>> does not exist just yet. So a small patch to wire in the ORTE piece to >>>>>>> allow the OMPI layer to request a set of processes to be terminated - >>>>>>> to more accurately support MPI_Abort semantics. >>>>>>> >>>>>>> Does that answer your questions? >>>>>>> >>>>>>> -- Josh >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> george. >>>>>>>> >>>>>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote: >>>>>>>> >>>>>>>>> WHAT: Fix missing code in MPI_Abort >>>>>>>>> >>>>>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process >>>>>>>>> group defined by the communicator >>>>>>>>> >>>>>>>>> WHERE: Mostly orte/mca/errmgr >>>>>>>>> >>>>>>>>> WHEN: Open MPI trunk >>>>>>>>> >>>>>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf) >>>>>>>>> >>>>>>>>> Details: >>>>>>>>> ------------------------------------------- >>>>>>>>> A bitbucket branch is available here (last sync to r24757 of trunk) >>>>>>>>> https://bitbucket.org/jjhursey/ompi-abort/ >>>>>>>>> >>>>>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of >>>>>>>>> MPI_Abort, it states: >>>>>>>>> "This routine makes a best attempt to abort all tasks in the group of >>>>>>>>> comm." >>>>>>>>> >>>>>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling >>>>>>>>> process itself. The code to ask for the abort of the other processes >>>>>>>>> in the group defined by the communicator is commented out. Since one >>>>>>>>> process calling abort currently causes all processes in the job to >>>>>>>>> abort, it has not been a big deal. However as the group starts >>>>>>>>> exploring better resilience in the OMPI layer (with further support >>>>>>>>> from the ORTE layer) this aspect of MPI_Abort will become more >>>>>>>>> necessary to get right. >>>>>>>>> >>>>>>>>> This branch adds back the logic necessary for a single process calling >>>>>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of >>>>>>>>> processes be aborted. Once the request is sent to the HNP, the local >>>>>>>>> process then calls abort on itself. The HNP requests that the defined >>>>>>>>> subgroup of processes be terminated using the existing plm mechanisms >>>>>>>>> for doing so. >>>>>>>>> >>>>>>>>> This change has no effect on the current default user experienced >>>>>>>>> behavior of MPI_Abort. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Joshua Hursey >>>>>>>>> Postdoctoral Research Associate >>>>>>>>> Oak Ridge National Laboratory >>>>>>>>> http://users.nccs.gov/~jjhursey >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Joshua Hursey >>>>>>> Postdoctoral Research Associate >>>>>>> Oak Ridge National Laboratory >>>>>>> http://users.nccs.gov/~jjhursey >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> >>> >>> >>> -- >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey