Re: [OMPI devel] SIGSTOP and SIGCONT on orted

Pak Lui Fri, 2 Jun 2006 11:37:26 -0400

Ralph Castain wrote:

Jeff Squyres (jsquyres) wrote:
Just curious -- what's difficult about this? SIGTSTP and SIGCONT canbe caught; is there something preventing us from sending "stop" and"continue" messages (just like we send "die" messages)?
Nothing preventing it at all. The problem lies in what you do when youreceive it. Take the example of a launch that used orted daemons. Wecould pass the "stop" or "continue" message to the orted, which couldsignal its child processes (i.e., the application processes on thatnode) with the appropriate signal. That would stop/continue the childprocess just fine - but what about communications that are stillin-progress?? Bad news.
So instead you could pass the application process a "stop" message. Theprocess could then "quiet" the MPI-based messaging system, reply back tothe orted that all is now quiet, and then the orted could send theappropriate OS-level signal so the process would truly "stop"."Continue" is much easier, of course - there is no "quieting" to bedone, so the orted could just issue a "continue" signal to its children.

I agree that stopping orted may not be the behavior that we are lookingfor. Instead, we can send the signals to the application processes,since stopping them is what we are interested in.

The idea is to stop the resource consumption by the user processes oncethe stop signal is sent from N1GE, since orted is being anadministrative daemon rather than a running process that's doing work,it probably does not need to be accounted for the resource usage.

And since 'qrsh' does not issue a 'stop' orted but only give a stopsignal to mpirun, it's really up to mpirun to tell where to give thestop signal to.

Great - except we still haven't "stopped" the run-time! What happens ifthe registry is in the middle of a notification process (e.g., we hit astage gate and all the notification messages are being sent, or someoneis in the middle of a put that causes a set of subscriptions to fire andsend out messages - that may in turn cause additional action on theremote host)? What about messages being routed through the orteds (oncewe get the routing system in-place)?
Well, we now could go through a similar process to first "quiet" therun-time itself. We would have to ensure that every subsystem completedits on-going operation and then "stopped". We would of course have totell all the remote processes to "stop" first so that new requests wouldquit coming in, or else this process would never complete. Note thatthis means the remote processes would have to receive and "log" anynotifications that come in from the registry after we tell the processto "stop", but could not take action on those notices until we"continue" the process.
So now we have the MPI and run-time layers "quiet". We send a message tothe remote orteds indicating they should go ahead and send their localapplication processes an OS-level signal to "stop" so that the OS knowsnot to spend cycles on them. Unfortunately, we cannot do the same forthe orteds themselves, so that means that the orteds remain "awake" andoperating, but they can just "spin".
All sounds fine. Now all we have to deal with are: all the raceconditions inherent in what I just described; how to deal with receiptof asynchronous notifications when we've already been told to stop; thescenarios where we don't have orted daemons on every node; how tostop/restart major MPI collectives in mid operation; etc. etc.
Not saying it cannot be done - just indicating that there were reasonswhy it wasn't initially done other than "we just didn't get around toit". :-)

Excellent explanations. These issues seem to be non-trivial and I don'tsee that we can resolve them at this point, not even when we make surethe run-time communications are in the state of quiescence. It maybewise to keep this feature out for now.

(If I had to guess, I think the user is asking because some other MPIimplementations implement this kind of behavior)

I am not sure if we hear high demand from users for this feature or not,but while reading some of the posts on sunsource.net on job suspension,I actually don't other MPI implementations have done this, except forClusterTools, our previous MPI implementation. There are some issuesinvolve communications timeouts that you already mentioned, file IO,plus others. So it could be messy to implement this feature for paralleljobs in general.

http://gridengine.sunsource.net/servlets/ReadMsg?list=users&msgNo=1418

There are also some workaround mentioned, one is for user is to put theparallel job in a subordinate queue, or modify the existing queue withlower priority, insteading of putting the stop to freeze the applicationprocesses.

Thanks!
    ------------------------------------------------------------------------
    *From:* devel-boun...@open-mpi.org
    [mailto:devel-boun...@open-mpi.org] *On Behalf Of *Ralph Castain
    *Sent:* Thursday, June 01, 2006 10:50 PM
    *To:* Open MPI Developers
    *Subject:* Re: [OMPI devel] SIGSTOP and SIGCONT on orted

    Actually, there were some implementation issues that might prevent
    this from working and were the reason we didn't implement it right
    away. We don't actually transmit the SIGTERM - we capture it in
    mpirun and then propagate our own "die" command to the remote
    processes and daemons. Fortunately, "die" is very easy to implement.

    Unfortunately, "stop" and "continue" are much harder to implement
    from inside of a process. We'll have to look at it, but this may
    not really be feasible.

    Ralph



    Jeff Squyres (jsquyres) wrote:
The main reason that it doesn't work is because we didn't do any thing
to make it work.  :-)

Specifically, mpirun is not intercepting SIGSTOP and passing it on to
the remote nodes.  There is nothing in the design or architecture that
would prevent this, but we just don't do it [yet].
-----Original Message-----
From: devel-boun...@open-mpi.org[mailto:devel-boun...@open-mpi.org] On Behalf Of Pak Lui
Sent: Thursday, June 01, 2006 5:02 PM
To: de...@open-mpi.org
Subject: [OMPI devel] SIGSTOP and SIGCONT on orted

Hi,
I have a question on signals. Normally when I do a SIGTERM(control-C)on mpirun, the signal seems to get handled in a way that itbroadcaststo the orted and processes on the execution hosts. However,when I senda SIGSTOP to mpirun, mpirun seems to have stopped, but theprocesses ofthe user executable continue to run. I guess I could hook up thedebugger to mpirun and orted to see why they are handled differently,but I guess I anxious to hear about it here.
I am trying to see the behavior of SIGSTOP and SIGCONT for thesuspension/resumption feature in N1GE. It'll try to use thesesignals tostop and continue both mpirun and orted (and its processes), but thesignals (SIGSTOP and SIGCONT) don't seem to get propagated tothe remoteorted.
I can see there are some issues for implementing this feature on N1GEbecause the 'qrsh' interface does not send the signal to orted on theremote node, but only to 'mpirun'. I am trying to see how towork aroundthis.
--

Thanks,

- Pak Lui
pak....@sun.com

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
------------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--

Thanks,

- Pak Lui
pak....@sun.com

Re: [OMPI devel] SIGSTOP and SIGCONT on orted

Reply via email to