So why not have the callback return an int, and your callback returns "go no 
further"?


On Jun 10, 2011, at 2:06 PM, Josh Hursey wrote:

> Yeah I do not want the default fatal callback in OMPI. I want to
> replace it with something that allows OMPI to continue running when
> there are process failures (if the error handlers associated with the
> communicators permit such an action). So having the default fatal
> callback called after mine would not be useful, since I do not want
> the fatal action.
> 
> As long as I can replace that callback, or selectively get rid of it
> then I'm ok.
> 
> 
> On Fri, Jun 10, 2011 at 3:55 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> 
>> On Jun 10, 2011, at 6:32 AM, Josh Hursey wrote:
>> 
>>> On Fri, Jun 10, 2011 at 7:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>> 
>>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>> 
>>>>> 
>>>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>>> 
>>>>>> Well, you're way to trusty. ;)
>>>>> 
>>>>> It's the midwestern boy in me :)
>>>> 
>>>> Still need to shake that corn out of your head... :-)
>>>> 
>>>>> 
>>>>>> 
>>>>>> This only works if all component play the game, and even then there it 
>>>>>> is difficult if you want to allow components to deregister themselves in 
>>>>>> the middle of the execution. The problem is that a callback will be 
>>>>>> previous for some component, and that when you want to remove a callback 
>>>>>> you have to inform the "next"  component on the callback chain to change 
>>>>>> its previous.
>>>>> 
>>>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>>>> errmgr could be dangerous since it takes control from the upper layers, 
>>>>> but, conversely, trusting the upper layers to 'do the right thing' with 
>>>>> the previous callback is probably too optimistic, esp. for layers that 
>>>>> are not designed together.
>>>>> 
>>>>> To that I would suggest that you leave the code as is - registering a 
>>>>> callback overwrites the existing callback. That will allow me to replace 
>>>>> the default OMPI callback when I am able to in MPI_Init, and, if I need 
>>>>> to, swap back in the default version at MPI_Finalize.
>>>>> 
>>>>> Does that sound like a reasonable way forward on this design point?
>>>> 
>>>> It doesn't solve the problem that George alluded to - just because you 
>>>> overwrite the callback, it doesn't mean that someone else won't overwrite 
>>>> you when their component initializes. Only the last one wins - the rest of 
>>>> you lose.
>>>> 
>>>> I'm not sure how you guarantee that you win, which is why I'm unclear how 
>>>> this callback can really work unless everyone agrees that only one place 
>>>> gets it. Put that callback in a base function of a new error handling 
>>>> framework, and then let everyone create components within that for 
>>>> handling desired error responses?
>>> 
>>> Yep, that is a problem, but one that we can deal with in the immediate
>>> case. Since OMPI is the only layer registering the callback, when I
>>> replace it in OMPI I will have to make sure that no other place in
>>> OMPI replaces the callback.
>>> 
>>> If at some point we need more than one callback above ORTE then we may
>>> want to revisit this point. But since we only have one layer on top of
>>> ORTE, it is the responsibility of that layer to be internally
>>> consistent with regard to which callback it wants to be triggered.
>>> 
>>> If the layers above ORTE want more than one callback I would suggest
>>> that that layer design some mechanism for coordinating these multiple
>>> - possibly conflicting - callbacks (by the way this is policy
>>> management, which can get complex fast as you add more interested
>>> parties). Meaning that if OMPI wanted multiple callbacks to be active
>>> at the same time, then OMPI would create a mechanism for managing
>>> these callbacks, not ORTE. ORTE should just have one callback provided
>>> to the upper layer, and keep it -simple-. If the upper layer wants to
>>> toy around with something more complex it must manage the complexity
>>> instead of artificially pushing it down to the ORTE layer.
>> 
>> I was thinking some more about this, and wonder if we aren't 
>> over-complicating the question.
>> 
>> Do you need to actually control the sequence of callbacks, or just ensure 
>> that your callback gets called prior to the default one that calls abort?
>> 
>> Meeting the latter requirement is trivial - subsequent calls to 
>> register_callback get pushed onto the top of the callback list. Since the 
>> default one always gets registered first (which we can ensure since it 
>> occurs in MPI_Init), it will always be at the bottom of the callback list 
>> and hence called last.
>> 
>> Keeping that list in ORTE is simple and probably the right place to do it.
>> 
>> However, if you truly want to control the callback order in detail - then 
>> yeah, that should go up in  OMPI. I sure don't want to write all that code 
>> :-)
>> 
>> 
>>> 
>>> -- Josh
>>> 
>>>>> 
>>>>> -- Josh
>>>>> 
>>>>>> 
>>>>>> george.
>>>>>> 
>>>>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>>>> 
>>>>>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>>>>>> -------------
>>>>>>> 
>>>>>>> Which is a callback that just calls abort (which is what we want to do
>>>>>>> by default):
>>>>>>> -------------
>>>>>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>>>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>>>>>> }
>>>>>>> -------------
>>>>>>> 
>>>>>>> This is what I want to replace. I do -not- want ompi to abort just
>>>>>>> because a process failed. So I need a way to replace or remove this
>>>>>>> callback, and put in my own callback that 'does the right thing'.
>>>>>>> 
>>>>>>> The current patch allows me to overwrite the callback when I call:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&my_callback);
>>>>>>> -------------
>>>>>>> Which is fine with me.
>>>>>>> 
>>>>>>> At the point I do not want my_callback to be active any more (say in
>>>>>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>>>>>> so, with the patch's interface, I would have to know what the previous
>>>>>>> callback was and do:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>>>>>> -------------
>>>>>>> 
>>>>>>> This comes at a slight maintenance burden since now there will be two
>>>>>>> places in the code that must explicitly reference
>>>>>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>>>>>> sites would have to be updated.
>>>>>>> 
>>>>>>> 
>>>>>>> If you use the 'sigaction-like' interface then upon registration I
>>>>>>> would get the previous handler back (which would point to
>>>>>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback);
>>>>>>> -------------
>>>>>>> 
>>>>>>> And when it comes time to deregister my callback all I need to do is
>>>>>>> replace it with the previous callback - which I have a reference to,
>>>>>>> but do not need the explicit name of (passing NULL as the second
>>>>>>> argument tells the registration function that I don't care about the
>>>>>>> current callback):
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&prev_callback, NULL);
>>>>>>> -------------
>>>>>>> 
>>>>>>> 
>>>>>>> So the API in the patch is fine, and I can work with it. I just
>>>>>>> suggested that it might be slightly better to return the previous
>>>>>>> callback (as is done in other standard interfaces - e.g., sigaction)
>>>>>>> in case we wanted to do something with it later.
>>>>>>> 
>>>>>>> 
>>>>>>> What seems to be proposed now is making the errmgr keep a list of all
>>>>>>> registered callbacks and call them in some order. This seems odd, and
>>>>>>> definitely more complex. Maybe it was just not well explained.
>>>>>>> 
>>>>>>> Maybe that is just the "computer scientist" in me :)
>>>>>>> 
>>>>>>> -- Josh
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>>> You mean you want the abort API to point somewhere else, without using 
>>>>>>>> a new
>>>>>>>> component?
>>>>>>>> Perhaps a telecon would help resolve this quicker? I'm available 
>>>>>>>> tomorrow or
>>>>>>>> anytime next week, if that helps.
>>>>>>>> 
>>>>>>>> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey <jjhur...@open-mpi.org> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> As long as there is the ability to remove and replace a callback I'm
>>>>>>>>> fine. I personally think that forcing the errmgr to track ordering of
>>>>>>>>> callback registration makes it a more complex solution, but as long as
>>>>>>>>> it works.
>>>>>>>>> 
>>>>>>>>> In particular I need to replace the default 'abort' errmgr call in
>>>>>>>>> OMPI with something else. If both are called, then this does not help
>>>>>>>>> me at all - since the abort behavior will be activated either before
>>>>>>>>> or after my callback. So can you explain how I would do that with the
>>>>>>>>> current or the proposed interface?
>>>>>>>>> 
>>>>>>>>> -- Josh
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>> wrote:
>>>>>>>>>> I agree - let's not get overly complex unless we can clearly 
>>>>>>>>>> articulate
>>>>>>>>>> a
>>>>>>>>>> requirement to do so.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca 
>>>>>>>>>> <bosi...@eecs.utk.edu>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> This will require exactly opposite registration and de-registration
>>>>>>>>>>> order,
>>>>>>>>>>> or no de-registration at all (aka no way to unload a component). Or
>>>>>>>>>>> some
>>>>>>>>>>> even more complex code to deal with internally.
>>>>>>>>>>> 
>>>>>>>>>>> If the error manager handle the callbacks it can use the 
>>>>>>>>>>> registration
>>>>>>>>>>> ordering (which will be what the the approach can do), and can 
>>>>>>>>>>> enforce
>>>>>>>>>>> that
>>>>>>>>>>> all callbacks will be called. I would rather prefer this approach.
>>>>>>>>>>> 
>>>>>>>>>>> george.
>>>>>>>>>>> 
>>>>>>>>>>> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I would prefer returning the previous callback instead of relying 
>>>>>>>>>>>> on
>>>>>>>>>>>> the errmgr to get the ordering right. Additionally, when I want to
>>>>>>>>>>>> unregister (or replace) a call back it is easy to do that with a
>>>>>>>>>>>> single interface, than introducing a new one to remove a particular
>>>>>>>>>>>> callback.
>>>>>>>>>>>> Register:
>>>>>>>>>>>> ompi_errmgr.set_fault_callback(my_callback, prev_callback);
>>>>>>>>>>>> Deregister:
>>>>>>>>>>>> ompi_errmgr.set_fault_callback(prev_callback, old_callback);
>>>>>>>>>>>> or to eliminate all callbacks (if you needed that for somme 
>>>>>>>>>>>> reason):
>>>>>>>>>>>> ompi_errmgr.set_fault_callback(NULL, old_callback);
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Joshua Hursey
>>>>>>>>> Postdoctoral Research Associate
>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to