I guess I had in my head that Josh already working on most of these
issues anyway for the checkpoint / restart work (i.e., all the quiescing
stuff).  Indeed, if you think about it -- pause/resume is one form of a
checkpoint/restart.  Hence, if the checkpoint/restart frameworks are
laid out right -- and I think they are -- pause/resume may just be a
component in the checkpoint/restart frameworks (there's a little
hand-waving going on here, of course :-), but I'm trusting that Josh
will jump in if I have any heinously incorrect assumptions).
 
This also brings up another [minor] point -- we don't currently
propagate signals out from mpirun to remote processes (e.g., SIGUSR1).
There hasn't really been a need for this yet, so it's been a pretty low
priority.
 
Sorry for all the confusion, though -- I keyed off the phrase "there
were some implementation issues that might prevent this from working" in
your original e-mail, which I interpreted as "our implementation
prohibits this."  :-)
 

________________________________

        From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
        Sent: Friday, June 02, 2006 9:12 AM
        To: Open MPI Developers
        Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on orted
        
        


        Jeff Squyres (jsquyres) wrote: 

                Just curious -- what's difficult about this?  SIGTSTP
and SIGCONT can be caught; is there something preventing us from sending
"stop" and "continue" messages (just like we send "die" messages)?

        Nothing preventing it at all. The problem lies in what you do
when you receive it. Take the example of a launch that used orted
daemons. We could pass the "stop" or "continue" message to the orted,
which could signal its child processes (i.e., the application processes
on that node) with the appropriate signal. That would stop/continue the
child process just fine - but what about communications that are still
in-progress?? Bad news.
        
        So instead you could pass the application process a "stop"
message. The process could then "quiet" the MPI-based messaging system,
reply back to the orted that all is now quiet, and then the orted could
send the appropriate OS-level signal so the process would truly "stop".
"Continue" is much easier, of course - there is no "quieting" to be
done, so the orted could just issue a "continue" signal to its children.
        
        Great - except we still haven't "stopped" the run-time! What
happens if the registry is in the middle of a notification process
(e.g., we hit a stage gate and all the notification messages are being
sent, or someone is in the middle of a put that causes a set of
subscriptions to fire and send out messages - that may in turn cause
additional action on the remote host)? What about messages being routed
through the orteds (once we get the routing system in-place)?
        
        Well, we now could go through a similar process to first "quiet"
the run-time itself. We would have to ensure that every subsystem
completed its on-going operation and then "stopped". We would of course
have to tell all the remote processes to "stop" first so that new
requests would quit coming in, or else this process would never
complete. Note that this means the remote processes would have to
receive and "log" any notifications that come in from the registry after
we tell the process to "stop", but could not take action on those
notices until we "continue" the process.
        
        So now we have the MPI and run-time layers "quiet". We send a
message to the remote orteds indicating they should go ahead and send
their local application processes an OS-level signal to "stop" so that
the OS knows not to spend cycles on them. Unfortunately, we cannot do
the same for the orteds themselves, so that means that the orteds remain
"awake" and operating, but they can just "spin".
        
        All sounds fine. Now all we have to deal with are: all the race
conditions inherent in what I just described; how to deal with receipt
of asynchronous notifications when we've already been told to stop; the
scenarios where we don't have orted daemons on every node; how to
stop/restart major MPI collectives in mid operation; etc. etc.
        
        Not saying it cannot be done - just indicating that there were
reasons why it wasn't initially done other than "we just didn't get
around to it". :-) 
        
        
        

                 
                (If I had to guess, I think the user is asking because
some other MPI implementations implement this kind of behavior)
                 
                Thanks!


________________________________

                        From: devel-boun...@open-mpi.org
[mailto:devel-boun...@open-mpi.org] On Behalf Of Ralph Castain
                        Sent: Thursday, June 01, 2006 10:50 PM
                        To: Open MPI Developers
                        Subject: Re: [OMPI devel] SIGSTOP and SIGCONT on
orted
                        
                        
                        Actually, there were some implementation issues
that might prevent this from working and were the reason we didn't
implement it right away. We don't actually transmit the SIGTERM - we
capture it in mpirun and then propagate our own "die" command to the
remote processes and daemons. Fortunately, "die" is very easy to
implement.
                        
                        Unfortunately, "stop" and "continue" are much
harder to implement from inside of a process. We'll have to look at it,
but this may not really be feasible.
                        
                        Ralph
                        
                        
                        
                        Jeff Squyres (jsquyres) wrote: 

                                The main reason that it doesn't work is
because we didn't do any thing
                                to make it work.  :-)
                                
                                Specifically, mpirun is not intercepting
SIGSTOP and passing it on to
                                the remote nodes.  There is nothing in
the design or architecture that
                                would prevent this, but we just don't do
it [yet].
                                 
                                
                                  

                                -----Original Message-----
                                From: devel-boun...@open-mpi.org 
                                [mailto:devel-boun...@open-mpi.org] On
Behalf Of Pak Lui
                                Sent: Thursday, June 01, 2006 5:02 PM
                                To: de...@open-mpi.org
                                Subject: [OMPI devel] SIGSTOP and
SIGCONT on orted
                                
                                Hi,
                                
                                I have a question on signals. Normally
when I do a SIGTERM 
                                (control-C) 
                                on mpirun, the signal seems to get
handled in a way that it 
                                broadcasts 
                                to the orted and processes on the
execution hosts. However, 
                                when I send 
                                a SIGSTOP to mpirun, mpirun seems to
have stopped, but the 
                                processes of 
                                the user executable continue to run. I
guess I could hook up the 
                                debugger to mpirun and orted to see why
they are handled differently, 
                                but I guess I anxious to hear about it
here.
                                
                                I am trying to see the behavior of
SIGSTOP and SIGCONT for the 
                                suspension/resumption feature in N1GE.
It'll try to use these 
                                signals to 
                                stop and continue both mpirun and orted
(and its processes), but the 
                                signals (SIGSTOP and SIGCONT) don't seem
to get propagated to 
                                the remote 
                                orted.
                                
                                I can see there are some issues for
implementing this feature on N1GE 
                                because the 'qrsh' interface does not
send the signal to orted on the 
                                remote node, but only to 'mpirun'. I am
trying to see how to 
                                work around 
                                this.
                                
                                -- 
                                
                                Thanks,
                                
                                - Pak Lui
                                pak....@sun.com
                                
        
_______________________________________________
                                devel mailing list
                                de...@open-mpi.org
        
http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                
                                    

                                
        
_______________________________________________
                                devel mailing list
                                de...@open-mpi.org
        
http://www.open-mpi.org/mailman/listinfo.cgi/devel
                                
                                  

Reply via email to