Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

Ralph Castain Fri, 10 Jun 2011 08:25:53 -0400

I have no issue with uncommenting the code. However, I do see a future littered 
with lots of zombied processes and complaints over poor cleanup again....



On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:

> Ah I see what you are getting at now.
> 
> The construction of the list of connected processes is something I, 
> intentionally, did not modify from the current Open MPI code. The list is 
> calculated based on the locally known set of local and remote process groups 
> attached to the communicator. So this is the set of directly connected 
> processes in the specified communicator known to the calling process at the 
> OMPI level.
> 
> ORTE is asked to abort this defined set of processes. Once those processes 
> are terminated then ORTE needs to eventually inform all of the processes (in 
> the jobid(s) specified - maybe other jobids too?) that these processes have 
> failed/aborted. Upon notification of the failed/aborted processes the local 
> process (at the OMPI level) needs to determine if that process loss is 
> critical based upon the error handlers attached to communicators that it 
> shares with the failed/aborted processes.  That should be handled in the 
> callback from the errmgr at the OMPI level, since connectedness is an MPI 
> construct. If the process failure/abort is critical to the local process, 
> then upon notification the local process can call abort on the communicator 
> effected.
> 
> So this has the possibility for a rolling abort effect [the abort of one 
> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
> which (depending upon the error handlers at the user level) the system will 
> eventually converge to either some stable subset of process or all processes 
> aborting resulting in job termination.
> 
> The rolling abort effect relies heavily upon the ability of the runtime to 
> make sure that all process failures/abort are eventually known to all alive 
> processes. Since all alive processes will know of the failure/abort, it can 
> then determine if they are transitively effected by the failure based upon 
> the local list of communicators and associated error handlers. But to 
> complete this aspect of the abort procedure, we do need the callback 
> mechanism from the runtime - but since ORTE (today) will kill the job for 
> OMPI then it is not a big deal for end users since the job will terminate 
> anyway. Once we have the callback, then we can finish tightening up the OMPI 
> layer code.
> 
> It is not perfect, but I think it does address the transitive nature of the 
> connectivity of MPI processes by relying on the runtime to provide uniform 
> notification of failures. I figure that we will need to look over this code 
> again and verify that the implementation of MPI_Comm_disconnect and 
> associated underpinnings do the 'right thing' with regard to updating the 
> communicator structures. But I think that is best addressed as a second set 
> of patches.
> 
> 
> The goal of this patch is to put back in functionality that was commented out 
> during the last reorganization of the errmgr. What will likely follow, once 
> we have notification of failure/abort at the OMPI level, is a cleanup of the 
> connected groups code paths.
> 
> 
> -- Josh
> 
> 
> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
> 
>> What I'm saying is that there is no reason to have any other type of 
>> MPI_Abort if we are not able to compute the set of connected processes. 
>> 
>> With this RFC the processes on the communicator on MPI_Abort will abort. 
>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will be 
>> notified (if we suppose that the ORTE will not make a difference between 
>> aborted and faulty). As a result the entire MPI_COMM_WORLD will be aborted, 
>> if we consider a sane application where everyone use the same type of error 
>> handler. However, this is not enough. We have to distribute the abort signal 
>> to every other process "connected", and I don't see how we can compute this 
>> list of connected processes in Open MPI today.It is not that I don't see it 
>> in your patch, it is that the definition of the connectivity in the MPI 
>> standard is transitive and relies heavily on a correct implementation for 
>> the MPI_Comm_disconnect.
>> 
>> george.
>> 
>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
>> 
>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
>>>> If this change the behavior of MPI_Abort to only abort processes on the 
>>>> specified communicator how this doesn't affects the default user 
>>>> experience (when today it aborts everything)?
>>> 
>>> Open MPI does abort everything by default - decided by the runtime at
>>> the moment (but addressed in your RFC). So it does not matter if one
>>> process aborts or if many do. So the behavior of MPI_Abort experienced
>>> by the user will not change. Effectively the only change is an extra
>>> message in the runtime before the process actually calls
>>> errmgr.abort().
>>> 
>>> This branch just makes the implementation complete by first telling
>>> ORTE that a group of processes, defined by the communicator, should be
>>> terminated along with the calling process. Currently ORTE notices that
>>> there was an abort, and terminates the job. Once your RFC goes through
>>> then this may no longer be the case, and OMPI can determine what to do
>>> when it receives a process failure notification.
>>> 
>>>> 
>>>> If we accept the fact that MPI_Abort will only abort the processes in the 
>>>> current communicator what happens with the other processes in the same 
>>>> MPI_COMM_WORLD (but not on the communicator that has been used by 
>>>> MPI_Abort)?
>>> 
>>> Currently, ORTE will abort them as well. When your RFC goes through
>>> then the OMPI layer will be notified of the error and can take the
>>> appropriate action, as determined by the MPI standard.
>>> 
>>>> What about all the other connected processes (based on the connectivity as 
>>>> defined in the MPI standard in Section 10.5.4) ? Do they see this as a 
>>>> fault?
>>> 
>>> They are informed of the fault via the ORTE errmgr callback routine
>>> (that we have an RFC for), and then can take the appropriate action
>>> based on MPI semantics. So we are pushing the decision of the
>>> implication of the fault to the OMPI layer - where it should be.
>>> 
>>> 
>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other
>>> connected error management scenarios is not included in this patch
>>> since that depends on there being a callback to the OMPI layer - which
>>> does not exist just yet. So a small patch to wire in the ORTE piece to
>>> allow the OMPI layer to request a set of processes to be terminated -
>>> to more accurately support MPI_Abort semantics.
>>> 
>>> Does that answer your questions?
>>> 
>>> -- Josh
>>> 
>>> 
>>>> 
>>>> george.
>>>> 
>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote:
>>>> 
>>>>> WHAT: Fix missing code in MPI_Abort
>>>>> 
>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process
>>>>> group defined by the communicator
>>>>> 
>>>>> WHERE: Mostly orte/mca/errmgr
>>>>> 
>>>>> WHEN: Open MPI trunk
>>>>> 
>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>>>>> 
>>>>> Details:
>>>>> -------------------------------------------
>>>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>>> https://bitbucket.org/jjhursey/ompi-abort/
>>>>> 
>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of
>>>>> MPI_Abort, it states:
>>>>> "This routine makes a best attempt to abort all tasks in the group of 
>>>>> comm."
>>>>> 
>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling
>>>>> process itself. The code to ask for the abort of the other processes
>>>>> in the group defined by the communicator is commented out. Since one
>>>>> process calling abort currently causes all processes in the job to
>>>>> abort, it has not been a big deal. However as the group starts
>>>>> exploring better resilience in the OMPI layer (with further support
>>>>> from the ORTE layer) this aspect of MPI_Abort will become more
>>>>> necessary to get right.
>>>>> 
>>>>> This branch adds back the logic necessary for a single process calling
>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
>>>>> processes be aborted. Once the request is sent to the HNP, the local
>>>>> process then calls abort on itself. The HNP requests that the defined
>>>>> subgroup of processes be terminated using the existing plm mechanisms
>>>>> for doing so.
>>>>> 
>>>>> This change has no effect on the current default user experienced
>>>>> behavior of MPI_Abort.
>>>>> 
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> http://users.nccs.gov/~jjhursey
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> -- 
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Fix missing code in MPI_Abort functionality

Reply via email to