Re: [OMPI devel] Signals

Pak Lui Tue, 8 Apr 2008 13:36:52 -0400

First, can your user executable create a signal handler to catch theSIGUSR2 to not exit? By default on Solaris it is going to exit, unlessyou catch the signal and have the process to do nothing.


from signal(3HEAD)
     Name             Value   Default    Event
     SIGUSR1          16      Exit       User Signal 1
     SIGUSR2          17      Exit       User Signal 2

The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plmmight cause the processes to exit if the orted (or mpirun if it's onHNP) receives a signal like SIGUSR2; it'd work on killing all the userprocesses on that node once it receives a signal.

I workaround this for gridengine PLM. Once the gridengine_wait_daemon()receives a SIGUSR1/SIGUSR2 signal, it just lets the signals toacknowledge a signal returns, without declaring the launch_failed whichwould kill off the user processes. The signals would also get passed tothe user processes, and let them decide what to do with the signalsthemselves.

SGE needed this so the job kill or job suspension notification to workproperly since they would send a SIGUSR1/2 to mpirun. I believe this isprobably what you need in the rsh plm.


Richard Graham wrote:

I am running into a situation where I am trying to deliver a signal to the
mpi procs (sigusr2).  I deliver this to mpirun, which propagates it to the
mpi procs, but then proceeds to kill the children.  Is there an easy way
that I can get around this ?  I am using this mechanism in a situation where
I don't have a debugger, and trying to use this to turn on debugging when I
hit a hang, so killing the mpi procs is really not what I want to have
happen.

Thanks,
Rich

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--

- Pak Lui
[email protected]

Re: [OMPI devel] Signals

Reply via email to