Re: [OMPI devel] SIGSTOP and SIGCONT on orted

Josh Hursey Sun, 4 Jun 2006 15:05:37 -0400

Sorry for coming late to the conversation. Let me throw in my two cents.

Ask Jeff mentioned, I'm working on the checkpoint/restart (C/R)infrastructure in Open MPI at the moment. I'm currently workinginside the Open Run Time Environment (ORTE) layer highlighting andattempting to handle issues surrounding C/R.

Jeff is correct in saying that suspend/resume (via SIGSTOP SIGCONT)can be viewed as a (maybe special) case of C/R. One caveat is that wewill want to support suspend/resume and a low level checkpointer atthe same time, so this will be an interesting design point when weget to it. In the back of my mind I have been considering thissituation as I design the frameworks, as so far we are not doinganything to limit the ability to integrate this.

Ralph has highlighted some key issues that need to be solved in orderto properly suspend/resume an application process. Most of whichshould be handled as part of the C/R infrastructure. There are someother difficult items that are special for suspend/resume that don'tcrop up with traditional C/R such as an active runtime environmentwhile the application is suspended.

The suspend/resume scenario that I believe you are wanting (correctme if I am wrong) is that all of the application processes suspend/resume operation at the same time. This is, in some ways, the easiercase of suspend/resume. Versus the case where only a select subset ofapplication processes are suspended/resumed while the others continueexecution.

So in this scenario (where everyone is going through suspend/resumeat roughly the same time) we can adjust the runtime system toaccommodate this situation, say by queueing up subscriptionnotifications and messages on the wire to the applications uponsuspend then running the queues upon resume.

So this takes into account just the runtime system, the MPI layer hassome additional complexities [e.g, what if we are in the middle of acollective operation]. The C/R infrastructure will account forcoordinating these complexities. So I believe that SIGSTOP/SIGCONT tosupport a suspend/resume type operation will be possible once the C/Rinfrastructure is in place. It should be noted that the suspend/resume actions won't be 'quick' as there will be considerableoverhead in quieting things before we can suspend, and, conversely,waking things up upon resume.

Moral of the story is yea we should be able to support suspend/resume(via SIGSTOP/SIGCONT) of an application (not the runtime) once the C/R infrastructure is in place. However, it may not be supported in thevery first release of the infrastructure due to time constraints.


Cheers,
Josh

On Jun 2, 2006, at 10:55 AM, Ralph Castain wrote:

Jeff Squyres (jsquyres) wrote:
I guess I had in my head that Josh already working on most ofthese issues anyway for the checkpoint / restart work (i.e., allthe quiescing stuff). Indeed, if you think about it -- pause/resume is one form of a checkpoint/restart. Hence, if thecheckpoint/restart frameworks are laid out right -- and I thinkthey are -- pause/resume may just be a component in the checkpoint/restart frameworks (there's a little hand-waving going on here, ofcourse :-), but I'm trusting that Josh will jump in if I have anyheinously incorrect assumptions).
Good point - but Josh is only beginning to scratch the surface onthe issues I mentioned. Quite a ways from having something forgeneral use.
This also brings up another [minor] point -- we don't currentlypropagate signals out from mpirun to remote processes (e.g.,SIGUSR1). There hasn't really been a need for this yet, so it'sbeen a pretty low priority.
Sorry for all the confusion, though -- I keyed off the phrase"there were some implementation issues that might prevent thisfrom working" in your original e-mail, which I interpreted as "ourimplementation prohibits this." :-)
My fault - should have been clearer.
From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 02, 2006 9:12 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted



Jeff Squyres (jsquyres) wrote:
Just curious -- what's difficult about this? SIGTSTP and SIGCONTcan be caught; is there something preventing us from sending"stop" and "continue" messages (just like we send "die" messages)?
Nothing preventing it at all. The problem lies in what you do whenyou receive it. Take the example of a launch that used orteddaemons. We could pass the "stop" or "continue" message to theorted, which could signal its child processes (i.e., theapplication processes on that node) with the appropriate signal.That would stop/continue the child process just fine - but whatabout communications that are still in-progress?? Bad news.
So instead you could pass the application process a "stop"message. The process could then "quiet" the MPI-based messagingsystem, reply back to the orted that all is now quiet, and thenthe orted could send the appropriate OS-level signal so theprocess would truly "stop". "Continue" is much easier, of course -there is no "quieting" to be done, so the orted could just issue a"continue" signal to its children.
Great - except we still haven't "stopped" the run-time! Whathappens if the registry is in the middle of a notification process(e.g., we hit a stage gate and all the notification messages arebeing sent, or someone is in the middle of a put that causes a setof subscriptions to fire and send out messages - that may in turncause additional action on the remote host)? What about messagesbeing routed through the orteds (once we get the routing system in-place)?
Well, we now could go through a similar process to first "quiet"the run-time itself. We would have to ensure that every subsystemcompleted its on-going operation and then "stopped". We would ofcourse have to tell all the remote processes to "stop" first sothat new requests would quit coming in, or else this process wouldnever complete. Note that this means the remote processes wouldhave to receive and "log" any notifications that come in from theregistry after we tell the process to "stop", but could not takeaction on those notices until we "continue" the process.
So now we have the MPI and run-time layers "quiet". We send amessage to the remote orteds indicating they should go ahead andsend their local application processes an OS-level signal to"stop" so that the OS knows not to spend cycles on them.Unfortunately, we cannot do the same for the orteds themselves, sothat means that the orteds remain "awake" and operating, but theycan just "spin".
All sounds fine. Now all we have to deal with are: all the raceconditions inherent in what I just described; how to deal withreceipt of asynchronous notifications when we've already been toldto stop; the scenarios where we don't have orted daemons on everynode; how to stop/restart major MPI collectives in mid operation;etc. etc.
Not saying it cannot be done - just indicating that there werereasons why it wasn't initially done other than "we just didn'tget around to it". :-)
(If I had to guess, I think the user is asking because some otherMPI implementations implement this kind of behavior)
Thanks!
From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
Actually, there were some implementation issues that mightprevent this from working and were the reason we didn't implementit right away. We don't actually transmit the SIGTERM - wecapture it in mpirun and then propagate our own "die" command tothe remote processes and daemons. Fortunately, "die" is very easyto implement.
Unfortunately, "stop" and "continue" are much harder to implementfrom inside of a process. We'll have to look at it, but this maynot really be feasible.
Ralph



Jeff Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we didn't do anything to make it work. :-) Specifically, mpirun is notintercepting SIGSTOP and passing it on to the remote nodes.There is nothing in the design or architecture that wouldprevent this, but we just don't do it [yet].
-----Original Message----- From: devel-boun...@open-mpi.org[mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui Sent:Thursday, June 01, 2006 5:02 PM To: de...@open-mpi.org Subject:[OMPI devel] SIGSTOP and SIGCONT on orted Hi, I have a questionon signals. Normally when I do a SIGTERM (control-C) on mpirun,the signal seems to get handled in a way that it broadcasts tothe orted and processes on the execution hosts. However, when Isend a SIGSTOP to mpirun, mpirun seems to have stopped, but theprocesses of the user executable continue to run. I guess Icould hook up the debugger to mpirun and orted to see why theyare handled differently, but I guess I anxious to hear about ithere. I am trying to see the behavior of SIGSTOP and SIGCONTfor the suspension/resumption feature in N1GE. It'll try to usethese signals to stop and continue both mpirun and orted (andits processes), but the signals (SIGSTOP and SIGCONT) don'tseem to get propagated to the remote orted. I can see thereare some issues for implementing this feature on N1GE becausethe 'qrsh' interface does not send the signal to orted on theremote node, but only to 'mpirun'. I am trying to see how towork around this. -- Thanks, - Pak Lui pak....@sun.com_______________________________________________ devel mailinglist de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________ devel mailinglist de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________ devel mailing listde...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/

Re: [OMPI devel] SIGSTOP and SIGCONT on orted

Reply via email to