Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

Josh Hursey Wed, 15 Jun 2011 09:11:39 -0400

Committed in r24775.
  https://svn.open-mpi.org/trac/ompi/changeset/24775


Sorry for the delay on this, I got side tracked yesterday.

-- Josh

On Tue, Jun 14, 2011 at 11:36 AM, Josh Hursey <[email protected]> wrote:
> Just a reminder for those not on the call that this RFC is scheduled
> to go in later today.
>
> -- Josh
>
> On Fri, Jun 10, 2011 at 8:53 AM, Ralph Castain <[email protected]> wrote:
>>
>> On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote:
>>
>>> Why would this patch result in zombied processes and poor cleanup?
>>> When ORTE receive notification of a process terminating/aborting then
>>> it triggers the termination of the job (without UTK's RFC) which
>>> should ensure a clean shutdown. This patch just tells ORTE that a few
>>> other processes should be the first to die, which will trigger the
>>> same response in ORTE.
>>>
>>> I guess I'm unclear about this concern since it should be a concern in
>>> the current ORTE as well then. I agree that it will be a concern once
>>> we have the OMPI layer handling error management triggered off of a
>>> callback, but that is a different RFC.
>>
>> My comment was to "the future" - i.e., looking to the point where we get 
>> layered, rolling aborts.
>>
>> I agree that this specific RFC won't change the current behavior, and as I 
>> said, I have no issue with it.
>>
>>
>>>
>>>
>>> Something that might help those listening to this thread. The current
>>> behavior of MPI_Abort in OMPI results in the semantics of:
>>> --------------
>>> internal_MPI_Abort(MPI_COMM_SELF, exit_code)
>>> --------------
>>> regardless of the communicator actually passed to the MPI_Abort at the
>>> application level. It should be:
>>> --------------
>>> internal_MPI_Abort(comm_provided, exit_code)
>>> --------------
>>>
>>> Semantically, this patch just makes the group actually being aborted
>>> match the communicator provided. In practicality, the job will
>>> terminate when any process in the job calls abort - so the result (in
>>> todays codebase) will be the same.
>>>
>>> -- Josh
>>>
>>>
>>> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain <[email protected]> wrote:
>>>> I have no issue with uncommenting the code. However, I do see a future 
>>>> littered with lots of zombied processes and complaints over poor cleanup 
>>>> again....
>>>>
>>>>
>>>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:
>>>>
>>>>> Ah I see what you are getting at now.
>>>>>
>>>>> The construction of the list of connected processes is something I, 
>>>>> intentionally, did not modify from the current Open MPI code. The list is 
>>>>> calculated based on the locally known set of local and remote process 
>>>>> groups attached to the communicator. So this is the set of directly 
>>>>> connected processes in the specified communicator known to the calling 
>>>>> process at the OMPI level.
>>>>>
>>>>> ORTE is asked to abort this defined set of processes. Once those 
>>>>> processes are terminated then ORTE needs to eventually inform all of the 
>>>>> processes (in the jobid(s) specified - maybe other jobids too?) that 
>>>>> these processes have failed/aborted. Upon notification of the 
>>>>> failed/aborted processes the local process (at the OMPI level) needs to 
>>>>> determine if that process loss is critical based upon the error handlers 
>>>>> attached to communicators that it shares with the failed/aborted 
>>>>> processes.  That should be handled in the callback from the errmgr at the 
>>>>> OMPI level, since connectedness is an MPI construct. If the process 
>>>>> failure/abort is critical to the local process, then upon notification 
>>>>> the local process can call abort on the communicator effected.
>>>>>
>>>>> So this has the possibility for a rolling abort effect [the abort of one 
>>>>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. 
>>>>> From which (depending upon the error handlers at the user level) the 
>>>>> system will eventually converge to either some stable subset of process 
>>>>> or all processes aborting resulting in job termination.
>>>>>
>>>>> The rolling abort effect relies heavily upon the ability of the runtime 
>>>>> to make sure that all process failures/abort are eventually known to all 
>>>>> alive processes. Since all alive processes will know of the 
>>>>> failure/abort, it can then determine if they are transitively effected by 
>>>>> the failure based upon the local list of communicators and associated 
>>>>> error handlers. But to complete this aspect of the abort procedure, we do 
>>>>> need the callback mechanism from the runtime - but since ORTE (today) 
>>>>> will kill the job for OMPI then it is not a big deal for end users since 
>>>>> the job will terminate anyway. Once we have the callback, then we can 
>>>>> finish tightening up the OMPI layer code.
>>>>>
>>>>> It is not perfect, but I think it does address the transitive nature of 
>>>>> the connectivity of MPI processes by relying on the runtime to provide 
>>>>> uniform notification of failures. I figure that we will need to look over 
>>>>> this code again and verify that the implementation of MPI_Comm_disconnect 
>>>>> and associated underpinnings do the 'right thing' with regard to updating 
>>>>> the communicator structures. But I think that is best addressed as a 
>>>>> second set of patches.
>>>>>
>>>>>
>>>>> The goal of this patch is to put back in functionality that was commented 
>>>>> out during the last reorganization of the errmgr. What will likely 
>>>>> follow, once we have notification of failure/abort at the OMPI level, is 
>>>>> a cleanup of the connected groups code paths.
>>>>>
>>>>>
>>>>> -- Josh
>>>>>
>>>>>
>>>>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
>>>>>
>>>>>> What I'm saying is that there is no reason to have any other type of 
>>>>>> MPI_Abort if we are not able to compute the set of connected processes.
>>>>>>
>>>>>> With this RFC the processes on the communicator on MPI_Abort will abort. 
>>>>>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will 
>>>>>> be notified (if we suppose that the ORTE will not make a difference 
>>>>>> between aborted and faulty). As a result the entire MPI_COMM_WORLD will 
>>>>>> be aborted, if we consider a sane application where everyone use the 
>>>>>> same type of error handler. However, this is not enough. We have to 
>>>>>> distribute the abort signal to every other process "connected", and I 
>>>>>> don't see how we can compute this list of connected processes in Open 
>>>>>> MPI today.It is not that I don't see it in your patch, it is that the 
>>>>>> definition of the connectivity in the MPI standard is transitive and 
>>>>>> relies heavily on a correct implementation for the MPI_Comm_disconnect.
>>>>>>
>>>>>> george.
>>>>>>
>>>>>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
>>>>>>
>>>>>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <[email protected]> 
>>>>>>> wrote:
>>>>>>>> If this change the behavior of MPI_Abort to only abort processes on 
>>>>>>>> the specified communicator how this doesn't affects the default user 
>>>>>>>> experience (when today it aborts everything)?
>>>>>>>
>>>>>>> Open MPI does abort everything by default - decided by the runtime at
>>>>>>> the moment (but addressed in your RFC). So it does not matter if one
>>>>>>> process aborts or if many do. So the behavior of MPI_Abort experienced
>>>>>>> by the user will not change. Effectively the only change is an extra
>>>>>>> message in the runtime before the process actually calls
>>>>>>> errmgr.abort().
>>>>>>>
>>>>>>> This branch just makes the implementation complete by first telling
>>>>>>> ORTE that a group of processes, defined by the communicator, should be
>>>>>>> terminated along with the calling process. Currently ORTE notices that
>>>>>>> there was an abort, and terminates the job. Once your RFC goes through
>>>>>>> then this may no longer be the case, and OMPI can determine what to do
>>>>>>> when it receives a process failure notification.
>>>>>>>
>>>>>>>>
>>>>>>>> If we accept the fact that MPI_Abort will only abort the processes in 
>>>>>>>> the current communicator what happens with the other processes in the 
>>>>>>>> same MPI_COMM_WORLD (but not on the communicator that has been used by 
>>>>>>>> MPI_Abort)?
>>>>>>>
>>>>>>> Currently, ORTE will abort them as well. When your RFC goes through
>>>>>>> then the OMPI layer will be notified of the error and can take the
>>>>>>> appropriate action, as determined by the MPI standard.
>>>>>>>
>>>>>>>> What about all the other connected processes (based on the 
>>>>>>>> connectivity as defined in the MPI standard in Section 10.5.4) ? Do 
>>>>>>>> they see this as a fault?
>>>>>>>
>>>>>>> They are informed of the fault via the ORTE errmgr callback routine
>>>>>>> (that we have an RFC for), and then can take the appropriate action
>>>>>>> based on MPI semantics. So we are pushing the decision of the
>>>>>>> implication of the fault to the OMPI layer - where it should be.
>>>>>>>
>>>>>>>
>>>>>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other
>>>>>>> connected error management scenarios is not included in this patch
>>>>>>> since that depends on there being a callback to the OMPI layer - which
>>>>>>> does not exist just yet. So a small patch to wire in the ORTE piece to
>>>>>>> allow the OMPI layer to request a set of processes to be terminated -
>>>>>>> to more accurately support MPI_Abort semantics.
>>>>>>>
>>>>>>> Does that answer your questions?
>>>>>>>
>>>>>>> -- Josh
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> george.
>>>>>>>>
>>>>>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote:
>>>>>>>>
>>>>>>>>> WHAT: Fix missing code in MPI_Abort
>>>>>>>>>
>>>>>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process
>>>>>>>>> group defined by the communicator
>>>>>>>>>
>>>>>>>>> WHERE: Mostly orte/mca/errmgr
>>>>>>>>>
>>>>>>>>> WHEN: Open MPI trunk
>>>>>>>>>
>>>>>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>>>>>>>>>
>>>>>>>>> Details:
>>>>>>>>> -------------------------------------------
>>>>>>>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>>>>>>> https://bitbucket.org/jjhursey/ompi-abort/
>>>>>>>>>
>>>>>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of
>>>>>>>>> MPI_Abort, it states:
>>>>>>>>> "This routine makes a best attempt to abort all tasks in the group of 
>>>>>>>>> comm."
>>>>>>>>>
>>>>>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling
>>>>>>>>> process itself. The code to ask for the abort of the other processes
>>>>>>>>> in the group defined by the communicator is commented out. Since one
>>>>>>>>> process calling abort currently causes all processes in the job to
>>>>>>>>> abort, it has not been a big deal. However as the group starts
>>>>>>>>> exploring better resilience in the OMPI layer (with further support
>>>>>>>>> from the ORTE layer) this aspect of MPI_Abort will become more
>>>>>>>>> necessary to get right.
>>>>>>>>>
>>>>>>>>> This branch adds back the logic necessary for a single process calling
>>>>>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
>>>>>>>>> processes be aborted. Once the request is sent to the HNP, the local
>>>>>>>>> process then calls abort on itself. The HNP requests that the defined
>>>>>>>>> subgroup of processes be terminated using the existing plm mechanisms
>>>>>>>>> for doing so.
>>>>>>>>>
>>>>>>>>> This change has no effect on the current default user experienced
>>>>>>>>> behavior of MPI_Abort.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Joshua Hursey
>>>>>>>>> Postdoctoral Research Associate
>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
>
>
> --
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

Reply via email to