So why not have the callback return an int, and your callback returns "go no further"?
On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote: > Yeah I do not want the default fatal callback in OMPI. I want to > replace it with something that allows OMPI to continue running when > there are process failures (if the error handlers associated with the > communicators permit such an action). So having the default fatal > callback called after mine would not be useful, since I do not want > the fatal action. > > As long as I can replace that callback, or selectively get rid of it > then I'm ok. > > > On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote: >> >>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote: >>>> >>>>> >>>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote: >>>>> >>>>>> Well, you're way to trusty. ;) >>>>> >>>>> It's the midwestern boy in me :) >>>> >>>> Still need to shake that corn out of your head... :-) >>>> >>>>> >>>>>> >>>>>> This only works if all component play the game, and even then there it >>>>>> is difficult if you want to allow components to deregister themselves in >>>>>> the middle of the execution. The problem is that a callback will be >>>>>> previous for some component, and that when you want to remove a callback >>>>>> you have to inform the "next" component on the callback chain to change >>>>>> its previous. >>>>> >>>>> This is a fair point. I think hiding the ordering of callbacks in the >>>>> errmgr could be dangerous since it takes control from the upper layers, >>>>> but, conversely, trusting the upper layers to 'do the right thing' with >>>>> the previous callback is probably too optimistic, esp. for layers that >>>>> are not designed together. >>>>> >>>>> To that I would suggest that you leave the code as is - registering a >>>>> callback overwrites the existing callback. That will allow me to replace >>>>> the default OMPI callback when I am able to in MPI_Init, and, if I need >>>>> to, swap back in the default version at MPI_Finalize. >>>>> >>>>> Does that sound like a reasonable way forward on this design point? >>>> >>>> It doesn't solve the problem that George alluded to - just because you >>>> overwrite the callback, it doesn't mean that someone else won't overwrite >>>> you when their component initializes. Only the last one wins - the rest of >>>> you lose. >>>> >>>> I'm not sure how you guarantee that you win, which is why I'm unclear how >>>> this callback can really work unless everyone agrees that only one place >>>> gets it. Put that callback in a base function of a new error handling >>>> framework, and then let everyone create components within that for >>>> handling desired error responses? >>> >>> Yep, that is a problem, but one that we can deal with in the immediate >>> case. Since OMPI is the only layer registering the callback, when I >>> replace it in OMPI I will have to make sure that no other place in >>> OMPI replaces the callback. >>> >>> If at some point we need more than one callback above ORTE then we may >>> want to revisit this point. But since we only have one layer on top of >>> ORTE, it is the responsibility of that layer to be internally >>> consistent with regard to which callback it wants to be triggered. >>> >>> If the layers above ORTE want more than one callback I would suggest >>> that that layer design some mechanism for coordinating these multiple >>> - possibly conflicting - callbacks (by the way this is policy >>> management, which can get complex fast as you add more interested >>> parties). Meaning that if OMPI wanted multiple callbacks to be active >>> at the same time, then OMPI would create a mechanism for managing >>> these callbacks, not ORTE. ORTE should just have one callback provided >>> to the upper layer, and keep it -simple-. If the upper layer wants to >>> toy around with something more complex it must manage the complexity >>> instead of artificially pushing it down to the ORTE layer. >> >> I was thinking some more about this, and wonder if we aren't >> over-complicating the question. >> >> Do you need to actually control the sequence of callbacks, or just ensure >> that your callback gets called prior to the default one that calls abort? >> >> Meeting the latter requirement is trivial - subsequent calls to >> register_callback get pushed onto the top of the callback list. Since the >> default one always gets registered first (which we can ensure since it >> occurs in MPI_Init), it will always be at the bottom of the callback list >> and hence called last. >> >> Keeping that list in ORTE is simple and probably the right place to do it. >> >> However, if you truly want to control the callback order in detail - then >> yeah, that should go up in OMPI. I sure don't want to write all that code >> :-) >> >> >>> >>> -- Josh >>> >>>>> >>>>> -- Josh >>>>> >>>>>> >>>>>> george. >>>>>> >>>>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote: >>>>>> >>>>>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c: >>>>>>> ------------- >>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>>>>>> ------------- >>>>>>> >>>>>>> Which is a callback that just calls abort (which is what we want to do >>>>>>> by default): >>>>>>> ------------- >>>>>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) { >>>>>>> ompi_mpi_abort(MPI_COMM_WORLD, 1, false); >>>>>>> } >>>>>>> ------------- >>>>>>> >>>>>>> This is what I want to replace. I do -not- want ompi to abort just >>>>>>> because a process failed. So I need a way to replace or remove this >>>>>>> callback, and put in my own callback that 'does the right thing'. >>>>>>> >>>>>>> The current patch allows me to overwrite the callback when I call: >>>>>>> ------------- >>>>>>> orte_errmgr.set_fault_callback(&my_callback); >>>>>>> ------------- >>>>>>> Which is fine with me. >>>>>>> >>>>>>> At the point I do not want my_callback to be active any more (say in >>>>>>> MPI_Finalize) I would like to replace it with the old callback. To do >>>>>>> so, with the patch's interface, I would have to know what the previous >>>>>>> callback was and do: >>>>>>> ------------- >>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback); >>>>>>> ------------- >>>>>>> >>>>>>> This comes at a slight maintenance burden since now there will be two >>>>>>> places in the code that must explicitly reference >>>>>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both >>>>>>> sites would have to be updated. >>>>>>> >>>>>>> >>>>>>> If you use the 'sigaction-like' interface then upon registration I >>>>>>> would get the previous handler back (which would point to >>>>>>> 'ompi_errhandler_runtime_callback), and I can store it for later: >>>>>>> ------------- >>>>>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback); >>>>>>> ------------- >>>>>>> >>>>>>> And when it comes time to deregister my callback all I need to do is >>>>>>> replace it with the previous callback - which I have a reference to, >>>>>>> but do not need the explicit name of (passing NULL as the second >>>>>>> argument tells the registration function that I don't care about the >>>>>>> current callback): >>>>>>> ------------- >>>>>>> orte_errmgr.set_fault_callback(&prev_callback, NULL); >>>>>>> ------------- >>>>>>> >>>>>>> >>>>>>> So the API in the patch is fine, and I can work with it. I just >>>>>>> suggested that it might be slightly better to return the previous >>>>>>> callback (as is done in other standard interfaces - e.g., sigaction) >>>>>>> in case we wanted to do something with it later. >>>>>>> >>>>>>> >>>>>>> What seems to be proposed now is making the errmgr keep a list of all >>>>>>> registered callbacks and call them in some order. This seems odd, and >>>>>>> definitely more complex. Maybe it was just not well explained. >>>>>>> >>>>>>> Maybe that is just the "computer scientist" in me :) >>>>>>> >>>>>>> -- Josh >>>>>>> >>>>>>> >>>>>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain <r...@open-mpi.org> wrote: >>>>>>>> You mean you want the abort API to point somewhere else, without using >>>>>>>> a new >>>>>>>> component? >>>>>>>> Perhaps a telecon would help resolve this quicker? I'm available >>>>>>>> tomorrow or >>>>>>>> anytime next week, if that helps. >>>>>>>> >>>>>>>> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey <jjhur...@open-mpi.org> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> As long as there is the ability to remove and replace a callback I'm >>>>>>>>> fine. I personally think that forcing the errmgr to track ordering of >>>>>>>>> callback registration makes it a more complex solution, but as long as >>>>>>>>> it works. >>>>>>>>> >>>>>>>>> In particular I need to replace the default 'abort' errmgr call in >>>>>>>>> OMPI with something else. If both are called, then this does not help >>>>>>>>> me at all - since the abort behavior will be activated either before >>>>>>>>> or after my callback. So can you explain how I would do that with the >>>>>>>>> current or the proposed interface? >>>>>>>>> >>>>>>>>> -- Josh >>>>>>>>> >>>>>>>>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>>> I agree - let's not get overly complex unless we can clearly >>>>>>>>>> articulate >>>>>>>>>> a >>>>>>>>>> requirement to do so. >>>>>>>>>> >>>>>>>>>> On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca >>>>>>>>>> <bosi...@eecs.utk.edu> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> This will require exactly opposite registration and de-registration >>>>>>>>>>> order, >>>>>>>>>>> or no de-registration at all (aka no way to unload a component). Or >>>>>>>>>>> some >>>>>>>>>>> even more complex code to deal with internally. >>>>>>>>>>> >>>>>>>>>>> If the error manager handle the callbacks it can use the >>>>>>>>>>> registration >>>>>>>>>>> ordering (which will be what the the approach can do), and can >>>>>>>>>>> enforce >>>>>>>>>>> that >>>>>>>>>>> all callbacks will be called. I would rather prefer this approach. >>>>>>>>>>> >>>>>>>>>>> george. >>>>>>>>>>> >>>>>>>>>>> On Jun 9, 2011, at 08:36 , Josh Hursey wrote: >>>>>>>>>>> >>>>>>>>>>>> I would prefer returning the previous callback instead of relying >>>>>>>>>>>> on >>>>>>>>>>>> the errmgr to get the ordering right. Additionally, when I want to >>>>>>>>>>>> unregister (or replace) a call back it is easy to do that with a >>>>>>>>>>>> single interface, than introducing a new one to remove a particular >>>>>>>>>>>> callback. >>>>>>>>>>>> Register: >>>>>>>>>>>> ompi_errmgr.set_fault_callback(my_callback, prev_callback); >>>>>>>>>>>> Deregister: >>>>>>>>>>>> ompi_errmgr.set_fault_callback(prev_callback, old_callback); >>>>>>>>>>>> or to eliminate all callbacks (if you needed that for somme >>>>>>>>>>>> reason): >>>>>>>>>>>> ompi_errmgr.set_fault_callback(NULL, old_callback); >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Joshua Hursey >>>>>>>>> Postdoctoral Research Associate >>>>>>>>> Oak Ridge National Laboratory >>>>>>>>> http://users.nccs.gov/~jjhursey >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Joshua Hursey >>>>>>> Postdoctoral Research Associate >>>>>>> Oak Ridge National Laboratory >>>>>>> http://users.nccs.gov/~jjhursey >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> >>> >>> >>> -- >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > > -- > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel