Thanks for adding the capability to stop processing the callbacks. For the rest I have no preferences, lets move forward with what's in there and adapt if new needs appear.
Thanks, George. On Jul 15, 2013, at 16:05 , Ralph Castain <r...@open-mpi.org> wrote: > > On Jul 15, 2013, at 6:45 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> Ralph, >> >> Sorry for the late answer, we have quite a few things on our todo list right >> now. Here are few concerns I'm having about the proposed approach. >> >> 1. We would have preferred to have a list of processes for the >> ompi_errhandler_runtime_callback function. We don't necessary care about the >> error code, but having a list will allow us to move the notifications per >> bulk instead of one by one. > > No problem - I can easily make that change > >> >> 2. You made the registration of the callbacks ordered, and added special >> arguments to append or prepend callbacks to the list. Right now I can't >> figure out a good reason on how to use it especially that the order might be >> impose on the order the modules are loaded by the frameworks, thus not >> something we can easily control. >> >> 3. The callback list. The concept is useful, I don't know about the >> implementation. The current version doesn't support stopping the propagation >> of the error signal, which might be an issue in some cases. I can picture >> the fact that one level know about the issue, and know how to fix it, so the >> error does not need to propagate to other levels. This can be implemented in >> the old way interrupts were managed in DOS, with basically a simple _get / >> _set type of interface. If a callback wants to propagate the error it has >> first to retrieve the ancestor on the moment when it registered the callback >> and then explicitly calls it upon error. >> > > Yeah, these things bothered me too. I did it for only on reason. The current > implementation does as you describe in terms of the caller maintaining > ancestry. However, what if the first thing registered is the "abort" > callback? Then how do you avoid having "abort" called early in the process, > not giving other callbacks a chance to attempt to continue? > > So I started with two registration calls - one for a default, and the other > for anything else. Then it occurred to me that someone might want a > "prologue" handler - e.g., start the error handling by blocking the injection > of any more messages until we know what the problem is. So I added a > registration for a prologue. > > I now had registrations for a prologue, an epilogue, and a regular callback. > So I just generalized it, figuring that someone could ignore the ordering and > just add callbacks if they wanted to, but leaving the ability to specify "go > first" and "go last". > > I don't honestly have anything specific in mind for it, but that was the > reasoning. I added the ability to stop processing callbacks (a return of > OMPI_SUCCESS will stop it), so that is there. > > Any preferences? > >> Again, nothing major in the short term as it will take a significant amount >> of work to move the only user of such error handling capability (the FT >> prototype) back over the current version of the ORTE. >> >> Regards, >> George. >> >> >> >> On Jul 3, 2013, at 06:45 , Ralph Castain <r...@open-mpi.org> wrote: >> >>> **** NOTICE: This RFC modifies the MPI-RTE interface **** >>> >>> WHAT: revise the RTE error handling to allow registration of callbacks upon >>> RTE-detected errors >>> >>> WHY: currently, the RTE aborts the process if an RTE-detected error occurs. >>> This allows the upper layers (e.g., MPI) no chance to implement their own >>> error response strategy, and it precludes allowing user-defined error >>> handling. >>> >>> TIMEOUT: let's go for July 19th, pending further discussion >>> >>> George and I were talking about ORTE's error handling the other day in >>> regards to the right way to deal with errors in the updated OOB. >>> Specifically, it seemed a bad idea for a library such as ORTE to be >>> aborting the job on its own prerogative. If we lose a connection or cannot >>> send a message, then we really should just report it upwards and let the >>> application and/or upper layers decide what to do about it. >>> >>> The current code base only allows a single error callback to exist, which >>> seemed unduly limiting. So, based on the conversation, I've modified the >>> errmgr interface to provide a mechanism for registering any number of error >>> handlers (this replaces the current "set_fault_callback" API). When an >>> error occurs, these handlers will be called in order until one responds >>> that the error has been "resolved" - i.e., no further action is required. >>> The default MPI layer error handler is specified to go "last" and calls >>> mpi_abort, so the current "abort" behavior is preserved unless other error >>> handlers are registered. >>> >>> In the register_callback function, I provide an "order" param so you can >>> specify "this callback must come first" or "this callback must come last". >>> Seemed to me that we will probably have different code areas registering >>> callbacks, and one might require it go first (the default "abort" will >>> always require it go last). So you can append and prepend, or go first/last. >>> >>> The errhandler callback function passes the name of the proc involved >>> (which can be yourself for internal errors) and the error code. This is a >>> change from the current fault callback which returned an opal_pointer_array >>> of process names. >>> >>> The work is available for review in my bitbucket: >>> >>> https://bitbucket.org/rhc/ompi-errmgr >>> >>> I've attached the svn diff as well. >>> >>> Appreciate your comments - nothing in concrete. >>> Ralph >>> >>> <err.diff> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel