Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

Josh Hursey Tue, 14 Jun 2011 11:36:20 -0400

Just a reminder for those not on the call that this RFC is scheduled
to go in later today.


-- Josh

On Fri, Jun 10, 2011 at 8:53 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote:
>
>> Why would this patch result in zombied processes and poor cleanup?
>> When ORTE receive notification of a process terminating/aborting then
>> it triggers the termination of the job (without UTK's RFC) which
>> should ensure a clean shutdown. This patch just tells ORTE that a few
>> other processes should be the first to die, which will trigger the
>> same response in ORTE.
>>
>> I guess I'm unclear about this concern since it should be a concern in
>> the current ORTE as well then. I agree that it will be a concern once
>> we have the OMPI layer handling error management triggered off of a
>> callback, but that is a different RFC.
>
> My comment was to "the future" - i.e., looking to the point where we get 
> layered, rolling aborts.
>
> I agree that this specific RFC won't change the current behavior, and as I 
> said, I have no issue with it.
>
>
>>
>>
>> Something that might help those listening to this thread. The current
>> behavior of MPI_Abort in OMPI results in the semantics of:
>> --------------
>> internal_MPI_Abort(MPI_COMM_SELF, exit_code)
>> --------------
>> regardless of the communicator actually passed to the MPI_Abort at the
>> application level. It should be:
>> --------------
>> internal_MPI_Abort(comm_provided, exit_code)
>> --------------
>>
>> Semantically, this patch just makes the group actually being aborted
>> match the communicator provided. In practicality, the job will
>> terminate when any process in the job calls abort - so the result (in
>> todays codebase) will be the same.
>>
>> -- Josh
>>
>>
>> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>> I have no issue with uncommenting the code. However, I do see a future 
>>> littered with lots of zombied processes and complaints over poor cleanup 
>>> again....
>>>
>>>
>>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:
>>>
>>>> Ah I see what you are getting at now.
>>>>
>>>> The construction of the list of connected processes is something I, 
>>>> intentionally, did not modify from the current Open MPI code. The list is 
>>>> calculated based on the locally known set of local and remote process 
>>>> groups attached to the communicator. So this is the set of directly 
>>>> connected processes in the specified communicator known to the calling 
>>>> process at the OMPI level.
>>>>
>>>> ORTE is asked to abort this defined set of processes. Once those processes 
>>>> are terminated then ORTE needs to eventually inform all of the processes 
>>>> (in the jobid(s) specified - maybe other jobids too?) that these processes 
>>>> have failed/aborted. Upon notification of the failed/aborted processes the 
>>>> local process (at the OMPI level) needs to determine if that process loss 
>>>> is critical based upon the error handlers attached to communicators that 
>>>> it shares with the failed/aborted processes.  That should be handled in 
>>>> the callback from the errmgr at the OMPI level, since connectedness is an 
>>>> MPI construct. If the process failure/abort is critical to the local 
>>>> process, then upon notification the local process can call abort on the 
>>>> communicator effected.
>>>>
>>>> So this has the possibility for a rolling abort effect [the abort of one 
>>>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
>>>> which (depending upon the error handlers at the user level) the system 
>>>> will eventually converge to either some stable subset of process or all 
>>>> processes aborting resulting in job termination.
>>>>
>>>> The rolling abort effect relies heavily upon the ability of the runtime to 
>>>> make sure that all process failures/abort are eventually known to all 
>>>> alive processes. Since all alive processes will know of the failure/abort, 
>>>> it can then determine if they are transitively effected by the failure 
>>>> based upon the local list of communicators and associated error handlers. 
>>>> But to complete this aspect of the abort procedure, we do need the 
>>>> callback mechanism from the runtime - but since ORTE (today) will kill the 
>>>> job for OMPI then it is not a big deal for end users since the job will 
>>>> terminate anyway. Once we have the callback, then we can finish tightening 
>>>> up the OMPI layer code.
>>>>
>>>> It is not perfect, but I think it does address the transitive nature of 
>>>> the connectivity of MPI processes by relying on the runtime to provide 
>>>> uniform notification of failures. I figure that we will need to look over 
>>>> this code again and verify that the implementation of MPI_Comm_disconnect 
>>>> and associated underpinnings do the 'right thing' with regard to updating 
>>>> the communicator structures. But I think that is best addressed as a 
>>>> second set of patches.
>>>>
>>>>
>>>> The goal of this patch is to put back in functionality that was commented 
>>>> out during the last reorganization of the errmgr. What will likely follow, 
>>>> once we have notification of failure/abort at the OMPI level, is a cleanup 
>>>> of the connected groups code paths.
>>>>
>>>>
>>>> -- Josh
>>>>
>>>>
>>>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
>>>>
>>>>> What I'm saying is that there is no reason to have any other type of 
>>>>> MPI_Abort if we are not able to compute the set of connected processes.
>>>>>
>>>>> With this RFC the processes on the communicator on MPI_Abort will abort. 
>>>>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will 
>>>>> be notified (if we suppose that the ORTE will not make a difference 
>>>>> between aborted and faulty). As a result the entire MPI_COMM_WORLD will 
>>>>> be aborted, if we consider a sane application where everyone use the same 
>>>>> type of error handler. However, this is not enough. We have to distribute 
>>>>> the abort signal to every other process "connected", and I don't see how 
>>>>> we can compute this list of connected processes in Open MPI today.It is 
>>>>> not that I don't see it in your patch, it is that the definition of the 
>>>>> connectivity in the MPI standard is transitive and relies heavily on a 
>>>>> correct implementation for the MPI_Comm_disconnect.
>>>>>
>>>>> george.
>>>>>
>>>>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
>>>>>
>>>>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> 
>>>>>> wrote:
>>>>>>> If this change the behavior of MPI_Abort to only abort processes on the 
>>>>>>> specified communicator how this doesn't affects the default user 
>>>>>>> experience (when today it aborts everything)?
>>>>>>
>>>>>> Open MPI does abort everything by default - decided by the runtime at
>>>>>> the moment (but addressed in your RFC). So it does not matter if one
>>>>>> process aborts or if many do. So the behavior of MPI_Abort experienced
>>>>>> by the user will not change. Effectively the only change is an extra
>>>>>> message in the runtime before the process actually calls
>>>>>> errmgr.abort().
>>>>>>
>>>>>> This branch just makes the implementation complete by first telling
>>>>>> ORTE that a group of processes, defined by the communicator, should be
>>>>>> terminated along with the calling process. Currently ORTE notices that
>>>>>> there was an abort, and terminates the job. Once your RFC goes through
>>>>>> then this may no longer be the case, and OMPI can determine what to do
>>>>>> when it receives a process failure notification.
>>>>>>
>>>>>>>
>>>>>>> If we accept the fact that MPI_Abort will only abort the processes in 
>>>>>>> the current communicator what happens with the other processes in the 
>>>>>>> same MPI_COMM_WORLD (but not on the communicator that has been used by 
>>>>>>> MPI_Abort)?
>>>>>>
>>>>>> Currently, ORTE will abort them as well. When your RFC goes through
>>>>>> then the OMPI layer will be notified of the error and can take the
>>>>>> appropriate action, as determined by the MPI standard.
>>>>>>
>>>>>>> What about all the other connected processes (based on the connectivity 
>>>>>>> as defined in the MPI standard in Section 10.5.4) ? Do they see this as 
>>>>>>> a fault?
>>>>>>
>>>>>> They are informed of the fault via the ORTE errmgr callback routine
>>>>>> (that we have an RFC for), and then can take the appropriate action
>>>>>> based on MPI semantics. So we are pushing the decision of the
>>>>>> implication of the fault to the OMPI layer - where it should be.
>>>>>>
>>>>>>
>>>>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other
>>>>>> connected error management scenarios is not included in this patch
>>>>>> since that depends on there being a callback to the OMPI layer - which
>>>>>> does not exist just yet. So a small patch to wire in the ORTE piece to
>>>>>> allow the OMPI layer to request a set of processes to be terminated -
>>>>>> to more accurately support MPI_Abort semantics.
>>>>>>
>>>>>> Does that answer your questions?
>>>>>>
>>>>>> -- Josh
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> george.
>>>>>>>
>>>>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote:
>>>>>>>
>>>>>>>> WHAT: Fix missing code in MPI_Abort
>>>>>>>>
>>>>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process
>>>>>>>> group defined by the communicator
>>>>>>>>
>>>>>>>> WHERE: Mostly orte/mca/errmgr
>>>>>>>>
>>>>>>>> WHEN: Open MPI trunk
>>>>>>>>
>>>>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>>>>>>>>
>>>>>>>> Details:
>>>>>>>> -------------------------------------------
>>>>>>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>>>>>> https://bitbucket.org/jjhursey/ompi-abort/
>>>>>>>>
>>>>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of
>>>>>>>> MPI_Abort, it states:
>>>>>>>> "This routine makes a best attempt to abort all tasks in the group of 
>>>>>>>> comm."
>>>>>>>>
>>>>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling
>>>>>>>> process itself. The code to ask for the abort of the other processes
>>>>>>>> in the group defined by the communicator is commented out. Since one
>>>>>>>> process calling abort currently causes all processes in the job to
>>>>>>>> abort, it has not been a big deal. However as the group starts
>>>>>>>> exploring better resilience in the OMPI layer (with further support
>>>>>>>> from the ORTE layer) this aspect of MPI_Abort will become more
>>>>>>>> necessary to get right.
>>>>>>>>
>>>>>>>> This branch adds back the logic necessary for a single process calling
>>>>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
>>>>>>>> processes be aborted. Once the request is sent to the HNP, the local
>>>>>>>> process then calls abort on itself. The HNP requests that the defined
>>>>>>>> subgroup of processes be terminated using the existing plm mechanisms
>>>>>>>> for doing so.
>>>>>>>>
>>>>>>>> This change has no effect on the current default user experienced
>>>>>>>> behavior of MPI_Abort.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Joshua Hursey
>>>>>>>> Postdoctoral Research Associate
>>>>>>>> Oak Ridge National Laboratory
>>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Joshua Hursey
>>>>>> Postdoctoral Research Associate
>>>>>> Oak Ridge National Laboratory
>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>>
>>
>>
>>
>> --
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

Reply via email to