Re: [OMPI devel] RFC: revised ORTE error handling

Ralph Castain Mon, 15 Jul 2013 10:05:56 -0400

On Jul 15, 2013, at 6:45 AM, George Bosilca <[email protected]> wrote:


> Ralph,
> 
> Sorry for the late answer, we have quite a few things on our todo list right 
> now. Here are few concerns I'm having about the proposed approach.
> 
> 1. We would have preferred to have a list of processes for the 
> ompi_errhandler_runtime_callback function. We don't necessary care about the 
> error code, but having a list will allow us to move the notifications per 
> bulk instead of one by one.

No problem - I can easily make that change

> 
> 2. You made the registration of the callbacks ordered, and added special 
> arguments to append or prepend callbacks to the list. Right now I can't 
> figure out a good reason on how to use it especially that the order might be 
> impose on the order the modules are loaded by the frameworks, thus not 
> something we can easily control.
> 
> 3. The callback list. The concept is useful, I don't know about the 
> implementation. The current version doesn't support stopping the propagation 
> of the error signal, which might be an issue in some cases. I can picture the 
> fact that one level know about the issue, and know how to fix it, so the 
> error does not need to propagate to other levels. This can be implemented in 
> the old way interrupts were managed in DOS, with basically a simple _get / 
> _set type of interface. If a callback wants to propagate the error it has 
> first to retrieve the ancestor on the moment when it registered the callback 
> and then explicitly calls it upon error.
> 

Yeah, these things bothered me too. I did it for only on reason. The current 
implementation does as you describe in terms of the caller maintaining 
ancestry. However, what if the first thing registered is the "abort" callback? 
Then how do you avoid having "abort" called early in the process, not giving 
other callbacks a chance to attempt to continue?

So I started with two registration calls - one for a default, and the other for 
anything else. Then it occurred to me that someone might want a "prologue" 
handler - e.g., start the error handling by blocking the injection of any more 
messages until we know what the problem is. So I added a registration for a 
prologue.

I now had registrations for a prologue, an epilogue, and a regular callback. So 
I just generalized it, figuring that someone could ignore the ordering and just 
add callbacks if they wanted to, but leaving the ability to specify "go first" 
and "go last".

I don't honestly have anything specific in mind for it, but that was the 
reasoning. I added the ability to stop processing callbacks (a return of 
OMPI_SUCCESS will stop it), so that is there.

Any preferences?

> Again, nothing major in the short term as it will take a significant amount 
> of work to move the only user of such error handling capability (the FT 
> prototype) back over the current version of the ORTE.
> 
> Regards,
>   George.
> 
> 
> 
> On Jul 3, 2013, at 06:45 , Ralph Castain <[email protected]> wrote:
> 
>> **** NOTICE: This RFC modifies the MPI-RTE interface ****
>> 
>> WHAT: revise the RTE error handling to allow registration of callbacks upon 
>> RTE-detected errors
>> 
>> WHY: currently, the RTE aborts the process if an RTE-detected error occurs. 
>> This allows the upper layers (e.g., MPI) no chance to implement their own 
>> error response strategy, and it precludes allowing user-defined error 
>> handling.
>> 
>> TIMEOUT:  let's go for July 19th, pending further discussion
>> 
>> George and I were talking about ORTE's error handling the other day in 
>> regards to the right way to deal with errors in the updated OOB. 
>> Specifically, it seemed a bad idea for a library such as ORTE to be aborting 
>> the job on its own prerogative. If we lose a connection or cannot send a 
>> message, then we really should just report it upwards and let the 
>> application and/or upper layers decide what to do about it.
>> 
>> The current code base only allows a single error callback to exist, which 
>> seemed unduly limiting. So, based on the conversation, I've modified the 
>> errmgr interface to provide a mechanism for registering any number of error 
>> handlers (this replaces the current "set_fault_callback" API). When an error 
>> occurs, these handlers will be called in order until one responds that the 
>> error has been "resolved" - i.e., no further action is required. The default 
>> MPI layer error handler is specified to go "last" and calls mpi_abort, so 
>> the current "abort" behavior is preserved unless other error handlers are 
>> registered.
>> 
>> In the register_callback function, I provide an "order" param so you can 
>> specify "this callback must come first" or "this callback must come last". 
>> Seemed to me that we will probably have different code areas registering 
>> callbacks, and one might require it go first (the default "abort" will 
>> always require it go last). So you can append and prepend, or go first/last.
>> 
>> The errhandler callback function passes the name of the proc involved (which 
>> can be yourself for internal errors) and the error code. This is a change 
>> from the current fault callback which returned an opal_pointer_array of 
>> process names.
>> 
>> The work is available for review in my bitbucket:
>> 
>> https://bitbucket.org/rhc/ompi-errmgr
>> 
>> I've attached the svn diff as well.
>> 
>> Appreciate your comments - nothing in concrete.
>> Ralph
>> 
>> <err.diff>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: revised ORTE error handling

Reply via email to