Re: [OMPI devel] opal_event_loop exiting

Greg Watson Thu, 20 Apr 2006 13:25:21 -0400

Ok, thanks.

For clarification, the model we're using at the moment looks roughlylike this:


orte_init();

forever () {
        if (do_our_stuff() == GAME_OVER)
                break;
        opal_event_loop(OPAL_EVLOOP_ONCE);
}

orte_finalize();

The simplest change for us would be something like:

orte_init();

forever () {
        if (do_our_stuff() == GAME_OVER)
                break;
        if (opal_event_loop(OPAL_EVLOOP_ONCE) != ORTE_SUCCESS) {
                clean_up_our_stuff();
                break;
        }
}

orte_finalize();

Greg


On Apr 20, 2006, at 10:21 AM, Ralph Castain wrote:

You make a good point about the library not calling exit(). I'llhave to recruit some help to look at the notion of opal_even_loopreturning an error value - it isn't entirely clear who it wouldreturn it to in our system,. Even though I understand how someonein your situation would handle it, I have to ensure that it doesn'tcause the base system problems, or force a major code revision thatwould need to be scheduled into the project.
We'll have to get back to you on this - most of the folks are at aworkshop this week, so it will probably be next week before we candiscuss it.
Ralph


Greg Watson wrote:
The simplest thing for us would be for opal_event_loop() to returnan error value. That way we can detect the situation and clean upour system. At the moment we're not trying to restart orted, soclean recovery of orte is not that important, though ultimately Iwould think it is desirable. Other alternatives are to pass you anerror handler that you call, or you could send a signal that wecan trap.
>From our perspective, we're simply calling a library that doesstuff. Having the library call exit() at any point is a majorproblem for applications trying to do more than run a single job.
Greg

On Apr 20, 2006, at 9:40 AM, Ralph Castain wrote:
Well, I actually don't know much about opal_event_loop and/or howit is intended to work. My guess is that:
(a) your remote orted is acting as the seed and your localprocess (the one in Eclipse) is running as a client to that seed- at least, that was the case last I talked to Nathan
(b) when the seed orted dies, it is the oob in your local clientthat actually detects socket closure and decides that - since itis the seed that has lost contact - the local application mustabort.
(c) the errmgr.abort function does exactly what it was supposedto do - it provides an immediate way of killing the local process.
I'd be a little hesitant to recommend overloading theerrmgr.abort function as you really do want the local processesto die when losing connection to the seed (at least, until wedevelop a recovery capability for the seed orted - which is someways off), and (given the way you are running) I'm not sure youcan have a different errmgr for your process while leaving theother one for everyone else.
Probably the best solution for now would be for us to insert a(yet another) MCA parameter into the errmgr that would (if set)have errmgr.abort do something other than exit. The question thenis: what would you want it to do?? We need to have it tell therest of the system to stop trying to send messages etc - rightnow, I don't think the infrastructure exists to do that short ofkilling orte.
We could try to have errmgr.abort do an orte_finalize - thatwould kill the orte system without impacting your host program, Isuspect. You would then have to re-initialize, so we'd have tofind some way to let you know that we had finalized. I can'tswear this will work, though - we might well generate a segfaultsince this is happening deep down inside the system. We could tryit, though.
Would any of that be of help? Do you have any suggestions on howwe might let you know that we had finalized?
Ralph


Brian Barrett wrote:
On Apr 19, 2006, at 4:15 PM, Greg Watson wrote:
We've just run across a rather tricky issue. We're callingopal_event_loop() to dispatch orte events to an orted that hasbeen launched separately. However if the orted dies for somereason (gets a signal or whatever) then opal_event_loop() iscalling exit(). Needless to say, this is not good behavior us.Any suggestions on how to get around this problem?
Is the orted you are connecting to the "seed" daemon? I thinkthe only time we should be exiting like that is if the orted wasthe seed daemon. I'm not sure what we want to do if that's thecase -- it looks like we're calling errmgr.abort() when badnesshappens. I wonder if your application can provide its own errmgrcomponent that provides an abort that doesn't actually abort?Just some off the cuff ideas -- Ralph could probably give abetter idea of exactly what is happening... Brian
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] opal_event_loop exiting

Reply via email to