Sorry for coming late to the conversation. Let me throw in my two cents.
Ask Jeff mentioned, I'm working on the checkpoint/restart (C/R)
infrastructure in Open MPI at the moment. I'm currently working
inside the Open Run Time Environment (ORTE) layer highlighting and
attempting to handle issues surrounding C/R.
Jeff is correct in saying that suspend/resume (via SIGSTOP SIGCONT)
can be viewed as a (maybe special) case of C/R. One caveat is that we
will want to support suspend/resume and a low level checkpointer at
the same time, so this will be an interesting design point when we
get to it. In the back of my mind I have been considering this
situation as I design the frameworks, as so far we are not doing
anything to limit the ability to integrate this.
Ralph has highlighted some key issues that need to be solved in order
to properly suspend/resume an application process. Most of which
should be handled as part of the C/R infrastructure. There are some
other difficult items that are special for suspend/resume that don't
crop up with traditional C/R such as an active runtime environment
while the application is suspended.
The suspend/resume scenario that I believe you are wanting (correct
me if I am wrong) is that all of the application processes suspend/
resume operation at the same time. This is, in some ways, the easier
case of suspend/resume. Versus the case where only a select subset of
application processes are suspended/resumed while the others continue
execution.
So in this scenario (where everyone is going through suspend/resume
at roughly the same time) we can adjust the runtime system to
accommodate this situation, say by queueing up subscription
notifications and messages on the wire to the applications upon
suspend then running the queues upon resume.
So this takes into account just the runtime system, the MPI layer has
some additional complexities [e.g, what if we are in the middle of a
collective operation]. The C/R infrastructure will account for
coordinating these complexities. So I believe that SIGSTOP/SIGCONT to
support a suspend/resume type operation will be possible once the C/R
infrastructure is in place. It should be noted that the suspend/
resume actions won't be 'quick' as there will be considerable
overhead in quieting things before we can suspend, and, conversely,
waking things up upon resume.
Moral of the story is yea we should be able to support suspend/resume
(via SIGSTOP/SIGCONT) of an application (not the runtime) once the C/
R infrastructure is in place. However, it may not be supported in the
very first release of the infrastructure due to time constraints.
Cheers,
Josh
On Jun 2, 2006, at 10:55 AM, Ralph Castain wrote:
Jeff Squyres (jsquyres) wrote:
I guess I had in my head that Josh already working on most of
these issues anyway for the checkpoint / restart work (i.e., all
the quiescing stuff). Indeed, if you think about it -- pause/
resume is one form of a checkpoint/restart. Hence, if the
checkpoint/restart frameworks are laid out right -- and I think
they are -- pause/resume may just be a component in the checkpoint/
restart frameworks (there's a little hand-waving going on here, of
course :-), but I'm trusting that Josh will jump in if I have any
heinously incorrect assumptions).
Good point - but Josh is only beginning to scratch the surface on
the issues I mentioned. Quite a ways from having something for
general use.
This also brings up another [minor] point -- we don't currently
propagate signals out from mpirun to remote processes (e.g.,
SIGUSR1). There hasn't really been a need for this yet, so it's
been a pretty low priority.
Sorry for all the confusion, though -- I keyed off the phrase
"there were some implementation issues that might prevent this
from working" in your original e-mail, which I interpreted as "our
implementation prohibits this." :-)
My fault - should have been clearer.
From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
mpi.org] On Behalf Of Ralph Castain
Sent: Friday, June 02, 2006 9:12 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
Jeff Squyres (jsquyres) wrote:
Just curious -- what's difficult about this? SIGTSTP and SIGCONT
can be caught; is there something preventing us from sending
"stop" and "continue" messages (just like we send "die" messages)?
Nothing preventing it at all. The problem lies in what you do when
you receive it. Take the example of a launch that used orted
daemons. We could pass the "stop" or "continue" message to the
orted, which could signal its child processes (i.e., the
application processes on that node) with the appropriate signal.
That would stop/continue the child process just fine - but what
about communications that are still in-progress?? Bad news.
So instead you could pass the application process a "stop"
message. The process could then "quiet" the MPI-based messaging
system, reply back to the orted that all is now quiet, and then
the orted could send the appropriate OS-level signal so the
process would truly "stop". "Continue" is much easier, of course -
there is no "quieting" to be done, so the orted could just issue a
"continue" signal to its children.
Great - except we still haven't "stopped" the run-time! What
happens if the registry is in the middle of a notification process
(e.g., we hit a stage gate and all the notification messages are
being sent, or someone is in the middle of a put that causes a set
of subscriptions to fire and send out messages - that may in turn
cause additional action on the remote host)? What about messages
being routed through the orteds (once we get the routing system in-
place)?
Well, we now could go through a similar process to first "quiet"
the run-time itself. We would have to ensure that every subsystem
completed its on-going operation and then "stopped". We would of
course have to tell all the remote processes to "stop" first so
that new requests would quit coming in, or else this process would
never complete. Note that this means the remote processes would
have to receive and "log" any notifications that come in from the
registry after we tell the process to "stop", but could not take
action on those notices until we "continue" the process.
So now we have the MPI and run-time layers "quiet". We send a
message to the remote orteds indicating they should go ahead and
send their local application processes an OS-level signal to
"stop" so that the OS knows not to spend cycles on them.
Unfortunately, we cannot do the same for the orteds themselves, so
that means that the orteds remain "awake" and operating, but they
can just "spin".
All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with
receipt of asynchronous notifications when we've already been told
to stop; the scenarios where we don't have orted daemons on every
node; how to stop/restart major MPI collectives in mid operation;
etc. etc.
Not saying it cannot be done - just indicating that there were
reasons why it wasn't initially done other than "we just didn't
get around to it". :-)
(If I had to guess, I think the user is asking because some other
MPI implementations implement this kind of behavior)
Thanks!
From: devel-boun...@open-mpi.org [mailto:devel-bounces@open-
mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, June 01, 2006 10:50 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
Actually, there were some implementation issues that might
prevent this from working and were the reason we didn't implement
it right away. We don't actually transmit the SIGTERM - we
capture it in mpirun and then propagate our own "die" command to
the remote processes and daemons. Fortunately, "die" is very easy
to implement.
Unfortunately, "stop" and "continue" are much harder to implement
from inside of a process. We'll have to look at it, but this may
not really be feasible.
Ralph
Jeff Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we didn't do any
thing to make it work. :-) Specifically, mpirun is not
intercepting SIGSTOP and passing it on to the remote nodes.
There is nothing in the design or architecture that would
prevent this, but we just don't do it [yet].
-----Original Message----- From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui Sent:
Thursday, June 01, 2006 5:02 PM To: de...@open-mpi.org Subject:
[OMPI devel] SIGSTOP and SIGCONT on orted Hi, I have a question
on signals. Normally when I do a SIGTERM (control-C) on mpirun,
the signal seems to get handled in a way that it broadcasts to
the orted and processes on the execution hosts. However, when I
send a SIGSTOP to mpirun, mpirun seems to have stopped, but the
processes of the user executable continue to run. I guess I
could hook up the debugger to mpirun and orted to see why they
are handled differently, but I guess I anxious to hear about it
here. I am trying to see the behavior of SIGSTOP and SIGCONT
for the suspension/resumption feature in N1GE. It'll try to use
these signals to stop and continue both mpirun and orted (and
its processes), but the signals (SIGSTOP and SIGCONT) don't
seem to get propagated to the remote orted. I can see there
are some issues for implementing this feature on N1GE because
the 'qrsh' interface does not send the signal to orted on the
remote node, but only to 'mpirun'. I am trying to see how to
work around this. -- Thanks, - Pak Lui pak....@sun.com
_______________________________________________ devel mailing
list de...@open-mpi.org http://www.open-mpi.org/mailman/
listinfo.cgi/devel
_______________________________________________ devel mailing
list de...@open-mpi.org http://www.open-mpi.org/mailman/
listinfo.cgi/devel
_______________________________________________ devel mailing list
de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
----
Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/