On Jun 10, 2011, at 6:48 AM, Josh Hursey wrote:

> Why would this patch result in zombied processes and poor cleanup?
> When ORTE receive notification of a process terminating/aborting then
> it triggers the termination of the job (without UTK's RFC) which
> should ensure a clean shutdown. This patch just tells ORTE that a few
> other processes should be the first to die, which will trigger the
> same response in ORTE.
> 
> I guess I'm unclear about this concern since it should be a concern in
> the current ORTE as well then. I agree that it will be a concern once
> we have the OMPI layer handling error management triggered off of a
> callback, but that is a different RFC.

My comment was to "the future" - i.e., looking to the point where we get 
layered, rolling aborts.

I agree that this specific RFC won't change the current behavior, and as I 
said, I have no issue with it.


> 
> 
> Something that might help those listening to this thread. The current
> behavior of MPI_Abort in OMPI results in the semantics of:
> --------------
> internal_MPI_Abort(MPI_COMM_SELF, exit_code)
> --------------
> regardless of the communicator actually passed to the MPI_Abort at the
> application level. It should be:
> --------------
> internal_MPI_Abort(comm_provided, exit_code)
> --------------
> 
> Semantically, this patch just makes the group actually being aborted
> match the communicator provided. In practicality, the job will
> terminate when any process in the job calls abort - so the result (in
> todays codebase) will be the same.
> 
> -- Josh
> 
> 
> On Fri, Jun 10, 2011 at 7:30 AM, Ralph Castain <r...@open-mpi.org> wrote:
>> I have no issue with uncommenting the code. However, I do see a future 
>> littered with lots of zombied processes and complaints over poor cleanup 
>> again....
>> 
>> 
>> On Jun 9, 2011, at 6:08 PM, Joshua Hursey wrote:
>> 
>>> Ah I see what you are getting at now.
>>> 
>>> The construction of the list of connected processes is something I, 
>>> intentionally, did not modify from the current Open MPI code. The list is 
>>> calculated based on the locally known set of local and remote process 
>>> groups attached to the communicator. So this is the set of directly 
>>> connected processes in the specified communicator known to the calling 
>>> process at the OMPI level.
>>> 
>>> ORTE is asked to abort this defined set of processes. Once those processes 
>>> are terminated then ORTE needs to eventually inform all of the processes 
>>> (in the jobid(s) specified - maybe other jobids too?) that these processes 
>>> have failed/aborted. Upon notification of the failed/aborted processes the 
>>> local process (at the OMPI level) needs to determine if that process loss 
>>> is critical based upon the error handlers attached to communicators that it 
>>> shares with the failed/aborted processes.  That should be handled in the 
>>> callback from the errmgr at the OMPI level, since connectedness is an MPI 
>>> construct. If the process failure/abort is critical to the local process, 
>>> then upon notification the local process can call abort on the communicator 
>>> effected.
>>> 
>>> So this has the possibility for a rolling abort effect [the abort of one 
>>> communicator triggers the abort of another due to MPI_ERR_ARE_FATAL]. From 
>>> which (depending upon the error handlers at the user level) the system will 
>>> eventually converge to either some stable subset of process or all 
>>> processes aborting resulting in job termination.
>>> 
>>> The rolling abort effect relies heavily upon the ability of the runtime to 
>>> make sure that all process failures/abort are eventually known to all alive 
>>> processes. Since all alive processes will know of the failure/abort, it can 
>>> then determine if they are transitively effected by the failure based upon 
>>> the local list of communicators and associated error handlers. But to 
>>> complete this aspect of the abort procedure, we do need the callback 
>>> mechanism from the runtime - but since ORTE (today) will kill the job for 
>>> OMPI then it is not a big deal for end users since the job will terminate 
>>> anyway. Once we have the callback, then we can finish tightening up the 
>>> OMPI layer code.
>>> 
>>> It is not perfect, but I think it does address the transitive nature of the 
>>> connectivity of MPI processes by relying on the runtime to provide uniform 
>>> notification of failures. I figure that we will need to look over this code 
>>> again and verify that the implementation of MPI_Comm_disconnect and 
>>> associated underpinnings do the 'right thing' with regard to updating the 
>>> communicator structures. But I think that is best addressed as a second set 
>>> of patches.
>>> 
>>> 
>>> The goal of this patch is to put back in functionality that was commented 
>>> out during the last reorganization of the errmgr. What will likely follow, 
>>> once we have notification of failure/abort at the OMPI level, is a cleanup 
>>> of the connected groups code paths.
>>> 
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Jun 9, 2011, at 6:13 PM, George Bosilca wrote:
>>> 
>>>> What I'm saying is that there is no reason to have any other type of 
>>>> MPI_Abort if we are not able to compute the set of connected processes.
>>>> 
>>>> With this RFC the processes on the communicator on MPI_Abort will abort. 
>>>> Then the other processes in the same MPI_COMM_WORLD (in fact jobid) will 
>>>> be notified (if we suppose that the ORTE will not make a difference 
>>>> between aborted and faulty). As a result the entire MPI_COMM_WORLD will be 
>>>> aborted, if we consider a sane application where everyone use the same 
>>>> type of error handler. However, this is not enough. We have to distribute 
>>>> the abort signal to every other process "connected", and I don't see how 
>>>> we can compute this list of connected processes in Open MPI today.It is 
>>>> not that I don't see it in your patch, it is that the definition of the 
>>>> connectivity in the MPI standard is transitive and relies heavily on a 
>>>> correct implementation for the MPI_Comm_disconnect.
>>>> 
>>>> george.
>>>> 
>>>> On Jun 9, 2011, at 16:59 , Josh Hursey wrote:
>>>> 
>>>>> On Thu, Jun 9, 2011 at 4:47 PM, George Bosilca <bosi...@eecs.utk.edu> 
>>>>> wrote:
>>>>>> If this change the behavior of MPI_Abort to only abort processes on the 
>>>>>> specified communicator how this doesn't affects the default user 
>>>>>> experience (when today it aborts everything)?
>>>>> 
>>>>> Open MPI does abort everything by default - decided by the runtime at
>>>>> the moment (but addressed in your RFC). So it does not matter if one
>>>>> process aborts or if many do. So the behavior of MPI_Abort experienced
>>>>> by the user will not change. Effectively the only change is an extra
>>>>> message in the runtime before the process actually calls
>>>>> errmgr.abort().
>>>>> 
>>>>> This branch just makes the implementation complete by first telling
>>>>> ORTE that a group of processes, defined by the communicator, should be
>>>>> terminated along with the calling process. Currently ORTE notices that
>>>>> there was an abort, and terminates the job. Once your RFC goes through
>>>>> then this may no longer be the case, and OMPI can determine what to do
>>>>> when it receives a process failure notification.
>>>>> 
>>>>>> 
>>>>>> If we accept the fact that MPI_Abort will only abort the processes in 
>>>>>> the current communicator what happens with the other processes in the 
>>>>>> same MPI_COMM_WORLD (but not on the communicator that has been used by 
>>>>>> MPI_Abort)?
>>>>> 
>>>>> Currently, ORTE will abort them as well. When your RFC goes through
>>>>> then the OMPI layer will be notified of the error and can take the
>>>>> appropriate action, as determined by the MPI standard.
>>>>> 
>>>>>> What about all the other connected processes (based on the connectivity 
>>>>>> as defined in the MPI standard in Section 10.5.4) ? Do they see this as 
>>>>>> a fault?
>>>>> 
>>>>> They are informed of the fault via the ORTE errmgr callback routine
>>>>> (that we have an RFC for), and then can take the appropriate action
>>>>> based on MPI semantics. So we are pushing the decision of the
>>>>> implication of the fault to the OMPI layer - where it should be.
>>>>> 
>>>>> 
>>>>> The remainder of the OMPI layer logic for MPI_ERRORS_RETURN and other
>>>>> connected error management scenarios is not included in this patch
>>>>> since that depends on there being a callback to the OMPI layer - which
>>>>> does not exist just yet. So a small patch to wire in the ORTE piece to
>>>>> allow the OMPI layer to request a set of processes to be terminated -
>>>>> to more accurately support MPI_Abort semantics.
>>>>> 
>>>>> Does that answer your questions?
>>>>> 
>>>>> -- Josh
>>>>> 
>>>>> 
>>>>>> 
>>>>>> george.
>>>>>> 
>>>>>> On Jun 9, 2011, at 16:32 , Josh Hursey wrote:
>>>>>> 
>>>>>>> WHAT: Fix missing code in MPI_Abort
>>>>>>> 
>>>>>>> WHY: MPI_Abort is missing logic to ask for termination of the process
>>>>>>> group defined by the communicator
>>>>>>> 
>>>>>>> WHERE: Mostly orte/mca/errmgr
>>>>>>> 
>>>>>>> WHEN: Open MPI trunk
>>>>>>> 
>>>>>>> TIMEOUT: Tuesday, June 14, 2011 (after teleconf)
>>>>>>> 
>>>>>>> Details:
>>>>>>> -------------------------------------------
>>>>>>> A bitbucket branch is available here (last sync to r24757 of trunk)
>>>>>>> https://bitbucket.org/jjhursey/ompi-abort/
>>>>>>> 
>>>>>>> In the MPI Standard (v2.2) Section 8.7 after the introduction of
>>>>>>> MPI_Abort, it states:
>>>>>>> "This routine makes a best attempt to abort all tasks in the group of 
>>>>>>> comm."
>>>>>>> 
>>>>>>> Open MPI currently only calls orte_errmgr.abort() to abort the calling
>>>>>>> process itself. The code to ask for the abort of the other processes
>>>>>>> in the group defined by the communicator is commented out. Since one
>>>>>>> process calling abort currently causes all processes in the job to
>>>>>>> abort, it has not been a big deal. However as the group starts
>>>>>>> exploring better resilience in the OMPI layer (with further support
>>>>>>> from the ORTE layer) this aspect of MPI_Abort will become more
>>>>>>> necessary to get right.
>>>>>>> 
>>>>>>> This branch adds back the logic necessary for a single process calling
>>>>>>> MPI_Abort to request, from ORTE errmgr, that a defined subgroup of
>>>>>>> processes be aborted. Once the request is sent to the HNP, the local
>>>>>>> process then calls abort on itself. The HNP requests that the defined
>>>>>>> subgroup of processes be terminated using the existing plm mechanisms
>>>>>>> for doing so.
>>>>>>> 
>>>>>>> This change has no effect on the current default user experienced
>>>>>>> behavior of MPI_Abort.
>>>>>>> 
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Joshua Hursey
>>>>> Postdoctoral Research Associate
>>>>> Oak Ridge National Laboratory
>>>>> http://users.nccs.gov/~jjhursey
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to