Re: [OMPI devel] RFC: Resilient ORTE

Ralph Castain Fri, 10 Jun 2011 10:18:42 -0400

On Jun 10, 2011, at 7:01 AM, Josh Hursey wrote:

> On Fri, Jun 10, 2011 at 8:51 AM, Ralph Castain <[email protected]> wrote:
>> 
>> On Jun 10, 2011, at 6:38 AM, Josh Hursey wrote:
>> 
>>> Another problem with this patch, that I mentioned to Wesley and George
>>> off list, is that it does not handle the case when mpirun/HNP is also
>>> hosting processes that might fail. In my testing of the patch it
>>> worked fine if mpirun/HNP was -not- hosting any processes, but once it
>>> had to host processes then unexpected behavior occurred when a process
>>> failed. So for those just listening to this thread, Wesley is working
>>> on a revised patch to address this problem that he will post when it
>>> is ready.
>> 
>> See my other response to the patch - I think we need to understand why we 
>> are storing state in multiple places as it can create unexpected behavior 
>> when things are out-of-sync.
>> 
>> 
>>> 
>>> 
>>> As far as the RML issue, doesn't the ORTE state machine branch handle
>>> that case? If it does, then let's push the solution to that problem
>>> until that branch comes around instead of solving it twice.
>> 
>> No, it doesn't - in fact, it's what breaks the current method. Because we no 
>> longer allow event recursion, the RML message never gets out of the app. 
>> Hence my question.
>> 
>> I honestly don't think we need to have orte be aware of the distinction 
>> between "aborted by cmd" and "aborted by signal" as the only diff is in the 
>> error message. There ought to be some other way of resolving this?
> 
> MPI_Abort will need to tell ORTE which processes should be 'aborted by
> signal' along with the calling process. So there needs to be a
> mechanism for that was well. Not sure if I have a good solution to
> this in mind just yet.


Ah yes - that would require a communication anyway.

> 
> A thought though, in the state machine version, the process calling
> MPI_Abort could post a message to the processing thread and return
> from the callback. The callback would have a check at the bottom to
> determine if MPI_Abort was triggered within the callback, and just
> sleep. The processing thread would progress the RML message and once
> finished call exit(). This implies that the application process has a
> separate processing thread. But I think we might be able to post the
> RML message in the callback, then wait for it to complete outside of
> the callback before returning control to the user. :/ Interesting.

Could work, though it does require a thread. You would have to be tricky about 
it, though, as it is possible the call to "abort" could occur in an event 
handler. If you block in that handler waiting for the message to have been 
sent, it never will leave as the RML uses the event lib to trigger the actual 
send.

I may have a solution to the latter problem. For similar reasons, I've had to 
change the errmgr so it doesn't immediately process errors - otherwise, it's 
actions become constrained by the question of "am I in an event handler or 
not". To remove the uncertainty, I'm rigging it so that all errmgr processing 
is done in an event - basically, reporting an error causes the errmgr to push 
the error into a pipe, that triggers an event which actually processes it.

Only way I could deal with the uncertainty. So if that mechanism is in place, 
the only thing you would have to do is (a) call abort, and then (b) cycle 
opal_progress until the errmgr.abort function callback occurred. Of course, we 
would then have to modify the errmgr so that abort took a callback function 
that it called when the app is free to exit.

<shrug> no perfect solution, I fear.



> 
> -- Josh
> 
>> 
>> 
>>> 
>>> -- Josh
>>> 
>>> 
>>> On Fri, Jun 10, 2011 at 8:22 AM, Ralph Castain <[email protected]> wrote:
>>>> Something else you might want to address in here: the current code sends 
>>>> an RML message from the proc calling abort to its local daemon telling the 
>>>> daemon that we are exiting due to the app calling "abort". We needed to do 
>>>> this because we wanted to flag the proc termination as one induced by the 
>>>> app itself as opposed to something like a segfault or termination by 
>>>> signal.
>>>> 
>>>> However, the problem is that the app may be calling abort from within an 
>>>> event handler. Hence, the RML send (which is currently blocking) will 
>>>> never complete once we no longer allow event lib recursion (coming soon). 
>>>> If we use a non-blocking send, then we can't know for sure that the 
>>>> message has been sent before we terminate.
>>>> 
>>>> What we need is a non-messaging way of communicating that this was an 
>>>> ordered abort as opposed to a segfault or other failure. Prior to the 
>>>> current method, we had the app drop a file that the daemon looked for as 
>>>> an "abort  marker", but that was ugly as it sometimes caused us to not 
>>>> properly cleanup the session directory tree.
>>>> 
>>>> I'm open to suggestion - perhaps it isn't actually all that critical for 
>>>> us to distinguish "aborted by call to abort" from "aborted by signal", and 
>>>> we can just have the app commit suicide via self-imposed SIGKILL? It is 
>>>> only the message output  to the user at the end of the job that differs - 
>>>> and since MPI_Abort already provides a message indicating "we called 
>>>> abort", is it really necessary that we have orte aware of that distinction?
>>>> 
>>>> 
>>>> On Jun 9, 2011, at 6:12 PM, Joshua Hursey wrote:
>>>> 
>>>>> 
>>>>> On Jun 9, 2011, at 6:47 PM, George Bosilca wrote:
>>>>> 
>>>>>> Well, you're way to trusty. ;)
>>>>> 
>>>>> It's the midwestern boy in me :)
>>>>> 
>>>>>> 
>>>>>> This only works if all component play the game, and even then there it 
>>>>>> is difficult if you want to allow components to deregister themselves in 
>>>>>> the middle of the execution. The problem is that a callback will be 
>>>>>> previous for some component, and that when you want to remove a callback 
>>>>>> you have to inform the "next"  component on the callback chain to change 
>>>>>> its previous.
>>>>> 
>>>>> This is a fair point. I think hiding the ordering of callbacks in the 
>>>>> errmgr could be dangerous since it takes control from the upper layers, 
>>>>> but, conversely, trusting the upper layers to 'do the right thing' with 
>>>>> the previous callback is probably too optimistic, esp. for layers that 
>>>>> are not designed together.
>>>>> 
>>>>> To that I would suggest that you leave the code as is - registering a 
>>>>> callback overwrites the existing callback. That will allow me to replace 
>>>>> the default OMPI callback when I am able to in MPI_Init, and, if I need 
>>>>> to, swap back in the default version at MPI_Finalize.
>>>>> 
>>>>> Does that sound like a reasonable way forward on this design point?
>>>>> 
>>>>> -- Josh
>>>>> 
>>>>>> 
>>>>>> george.
>>>>>> 
>>>>>> On Jun 9, 2011, at 13:21 , Josh Hursey wrote:
>>>>>> 
>>>>>>> So the "Resilient ORTE" patch has a registration in ompi_mpi_init.c:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>>>>>> -------------
>>>>>>> 
>>>>>>> Which is a callback that just calls abort (which is what we want to do
>>>>>>> by default):
>>>>>>> -------------
>>>>>>> void ompi_errhandler_runtime_callback(orte_process_name_t *proc) {
>>>>>>>  ompi_mpi_abort(MPI_COMM_WORLD, 1, false);
>>>>>>> }
>>>>>>> -------------
>>>>>>> 
>>>>>>> This is what I want to replace. I do -not- want ompi to abort just
>>>>>>> because a process failed. So I need a way to replace or remove this
>>>>>>> callback, and put in my own callback that 'does the right thing'.
>>>>>>> 
>>>>>>> The current patch allows me to overwrite the callback when I call:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&my_callback);
>>>>>>> -------------
>>>>>>> Which is fine with me.
>>>>>>> 
>>>>>>> At the point I do not want my_callback to be active any more (say in
>>>>>>> MPI_Finalize) I would like to replace it with the old callback. To do
>>>>>>> so, with the patch's interface, I would have to know what the previous
>>>>>>> callback was and do:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&ompi_errhandler_runtime_callback);
>>>>>>> -------------
>>>>>>> 
>>>>>>> This comes at a slight maintenance burden since now there will be two
>>>>>>> places in the code that must explicitly reference
>>>>>>> 'ompi_errhandler_runtime_callback' - if it ever changed then both
>>>>>>> sites would have to be updated.
>>>>>>> 
>>>>>>> 
>>>>>>> If you use the 'sigaction-like' interface then upon registration I
>>>>>>> would get the previous handler back (which would point to
>>>>>>> 'ompi_errhandler_runtime_callback), and I can store it for later:
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&my_callback, prev_callback);
>>>>>>> -------------
>>>>>>> 
>>>>>>> And when it comes time to deregister my callback all I need to do is
>>>>>>> replace it with the previous callback - which I have a reference to,
>>>>>>> but do not need the explicit name of (passing NULL as the second
>>>>>>> argument tells the registration function that I don't care about the
>>>>>>> current callback):
>>>>>>> -------------
>>>>>>> orte_errmgr.set_fault_callback(&prev_callback, NULL);
>>>>>>> -------------
>>>>>>> 
>>>>>>> 
>>>>>>> So the API in the patch is fine, and I can work with it. I just
>>>>>>> suggested that it might be slightly better to return the previous
>>>>>>> callback (as is done in other standard interfaces - e.g., sigaction)
>>>>>>> in case we wanted to do something with it later.
>>>>>>> 
>>>>>>> 
>>>>>>> What seems to be proposed now is making the errmgr keep a list of all
>>>>>>> registered callbacks and call them in some order. This seems odd, and
>>>>>>> definitely more complex. Maybe it was just not well explained.
>>>>>>> 
>>>>>>> Maybe that is just the "computer scientist" in me :)
>>>>>>> 
>>>>>>> -- Josh
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jun 9, 2011 at 1:05 PM, Ralph Castain <[email protected]> wrote:
>>>>>>>> You mean you want the abort API to point somewhere else, without using 
>>>>>>>> a new
>>>>>>>> component?
>>>>>>>> Perhaps a telecon would help resolve this quicker? I'm available 
>>>>>>>> tomorrow or
>>>>>>>> anytime next week, if that helps.
>>>>>>>> 
>>>>>>>> On Thu, Jun 9, 2011 at 11:02 AM, Josh Hursey <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> As long as there is the ability to remove and replace a callback I'm
>>>>>>>>> fine. I personally think that forcing the errmgr to track ordering of
>>>>>>>>> callback registration makes it a more complex solution, but as long as
>>>>>>>>> it works.
>>>>>>>>> 
>>>>>>>>> In particular I need to replace the default 'abort' errmgr call in
>>>>>>>>> OMPI with something else. If both are called, then this does not help
>>>>>>>>> me at all - since the abort behavior will be activated either before
>>>>>>>>> or after my callback. So can you explain how I would do that with the
>>>>>>>>> current or the proposed interface?
>>>>>>>>> 
>>>>>>>>> -- Josh
>>>>>>>>> 
>>>>>>>>> On Thu, Jun 9, 2011 at 12:54 PM, Ralph Castain <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>>> I agree - let's not get overly complex unless we can clearly 
>>>>>>>>>> articulate
>>>>>>>>>> a
>>>>>>>>>> requirement to do so.
>>>>>>>>>> 
>>>>>>>>>> On Thu, Jun 9, 2011 at 10:45 AM, George Bosilca 
>>>>>>>>>> <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> This will require exactly opposite registration and de-registration
>>>>>>>>>>> order,
>>>>>>>>>>> or no de-registration at all (aka no way to unload a component). Or
>>>>>>>>>>> some
>>>>>>>>>>> even more complex code to deal with internally.
>>>>>>>>>>> 
>>>>>>>>>>> If the error manager handle the callbacks it can use the 
>>>>>>>>>>> registration
>>>>>>>>>>> ordering (which will be what the the approach can do), and can 
>>>>>>>>>>> enforce
>>>>>>>>>>> that
>>>>>>>>>>> all callbacks will be called. I would rather prefer this approach.
>>>>>>>>>>> 
>>>>>>>>>>> george.
>>>>>>>>>>> 
>>>>>>>>>>> On Jun 9, 2011, at 08:36 , Josh Hursey wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> I would prefer returning the previous callback instead of relying 
>>>>>>>>>>>> on
>>>>>>>>>>>> the errmgr to get the ordering right. Additionally, when I want to
>>>>>>>>>>>> unregister (or replace) a call back it is easy to do that with a
>>>>>>>>>>>> single interface, than introducing a new one to remove a particular
>>>>>>>>>>>> callback.
>>>>>>>>>>>> Register:
>>>>>>>>>>>> ompi_errmgr.set_fault_callback(my_callback, prev_callback);
>>>>>>>>>>>> Deregister:
>>>>>>>>>>>> ompi_errmgr.set_fault_callback(prev_callback, old_callback);
>>>>>>>>>>>> or to eliminate all callbacks (if you needed that for somme 
>>>>>>>>>>>> reason):
>>>>>>>>>>>> ompi_errmgr.set_fault_callback(NULL, old_callback);
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> devel mailing list
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Joshua Hursey
>>>>>>>>> Postdoctoral Research Associate
>>>>>>>>> Oak Ridge National Laboratory
>>>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> [email protected]
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> [email protected]
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Joshua Hursey
>>>>>>> Postdoctoral Research Associate
>>>>>>> Oak Ridge National Laboratory
>>>>>>> http://users.nccs.gov/~jjhursey
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> [email protected]
>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> [email protected]
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> [email protected]
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
> 
> 
> 
> -- 
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] RFC: Resilient ORTE

Reply via email to