On 4/8/08 2:19 PM, "Ralph H Castain" <r...@lanl.gov> wrote:
>
>
>
> On 4/8/08 12:10 PM, "Pak Lui" <pak....@sun.com> wrote:
>
>> Richard Graham wrote:
>>> What happens if I deliver sigusr2 to mpirun ? What I observe (for both
>>> ssh/rsh and torque) that if I deliver a sigusr2 to mpirun, the signal does
>>> get propagated to the mpi procs, which do invoke the signal handler I
>>> registered, but the job is terminated right after that. However, if I
>>> deliver the signal directly to the mpi procs, the signal handler is invoked,
>>> and the job continues to run.
>>
>> This is exactly what I have observed previously when I made the
>> gridengine change. It is due to the fact that orterun (aka mpirun) is
>> the process fork and exec'ing the executables on the HNP. e.g. On the
>> remote nodes, you don't have this problem. So the wait_daemon function
>> picks up the signal from mpirun on HNP, then kill off the children.
>
> I'll look into this, but I don't believe this is true UNLESS something
> exits. The wait_daemon function only gets called when a proc terminates - it
> doesn't "pickup" a signal on its own. Perhaps we are just having a language
> problem here...
>
> In the rsh situation, the daemon "daemonizes" and closes the ssh session
> during launch. If the ssh session closed on a signal, then that would return
> and indicate that a daemon had failed to start, causing the abort. But that
> session is successfully closed PRIOR to the launch of any MPI procs. I note
> that we don't "deregister" the waitpid, though, so there may be some issue
> there.
>
> However, we most certainly do NOT look for such things in Torque. My guess
> is that something is causing a proc/daemon to abort, which then causes the
> system to abort the job.
>
> I have tried this on my Mac (got other things going on at the moment on the
> distributed machines), and all works as expected. However, that doesn't mean
> there isn't a problem in general.
Interesting - I do most of my development work on the Mac, and this is where
I also see the problem. I have not updated in a couple of days, so maybe
things have been fixed since.
Rich
>
> Will investigate when I have time shortly.
>
>>
>>>
>>> So, I think that what was intended to happen is the correct thing, but for
>>> some reason it is not happening.
>>>
>>> Rich
>>>
>>>
>>> On 4/8/08 1:47 PM, "Ralph H Castain" <r...@lanl.gov> wrote:
>>>
>>>> I found what Pak said a little confusing as the wait_daemon function
>>>> doesn't
>>>> actually receive a signal itself - it only detects that a proc has exited
>>>> and checks to see if that happened due to a signal. If so, it flags that
>>>> situation and will order the job aborted.
>>>>
>>>> So if the proc continues alive, the fact that it was hit with SIGUSR2 will
>>>> not be detected by ORTE nor will anything happen - however, if the OS uses
>>>> SIGUSR2 to terminate the proc, or if the proc terminates when it gets that
>>>> signal, we will see that proc terminate due to signal and abort the rest of
>>>> the job.
>>>>
>>>> We could change it if that is what people want - it is trivial to insert
>>>> code to say "kill everything except if it died due to a certain signal".
>>>>
>>>> <shrug> up to you folks. Current behavior is what you said you wanted a
>>>> long
>>>> time ago - nothing has changed in this regard for several years.
>>>>
>>>>
>>>> On 4/8/08 11:36 AM, "Pak Lui" <pak....@sun.com> wrote:
>>>>
>>>>> First, can your user executable create a signal handler to catch the
>>>>> SIGUSR2 to not exit? By default on Solaris it is going to exit, unless
>>>>> you catch the signal and have the process to do nothing.
>>>>>
>>>>> from signal(3HEAD)
>>>>> Name Value Default Event
>>>>> SIGUSR1 16 Exit User Signal 1
>>>>> SIGUSR2 17 Exit User Signal 2
>>>>>
>>>>> The other thing is, I suspect orte_plm_rsh_wait_daemon() in the rsh plm
>>>>> might cause the processes to exit if the orted (or mpirun if it's on
>>>>> HNP) receives a signal like SIGUSR2; it'd work on killing all the user
>>>>> processes on that node once it receives a signal.
>>>>>
>>>>> I workaround this for gridengine PLM. Once the gridengine_wait_daemon()
>>>>> receives a SIGUSR1/SIGUSR2 signal, it just lets the signals to
>>>>> acknowledge a signal returns, without declaring the launch_failed which
>>>>> would kill off the user processes. The signals would also get passed to
>>>>> the user processes, and let them decide what to do with the signals
>>>>> themselves.
>>>>>
>>>>> SGE needed this so the job kill or job suspension notification to work
>>>>> properly since they would send a SIGUSR1/2 to mpirun. I believe this is
>>>>> probably what you need in the rsh plm.
>>>>>
>>>>> Richard Graham wrote:
>>>>>> I am running into a situation where I am trying to deliver a signal to
>>>>>> the
>>>>>> mpi procs (sigusr2). I deliver this to mpirun, which propagates it to
>>>>>> the
>>>>>> mpi procs, but then proceeds to kill the children. Is there an easy way
>>>>>> that I can get around this ? I am using this mechanism in a situation
>>>>>> where
>>>>>> I don't have a debugger, and trying to use this to turn on debugging when
>>>>>> I
>>>>>> hit a hang, so killing the mpi procs is really not what I want to have
>>>>>> happen.
>>>>>>
>>>>>> Thanks,
>>>>>> Rich
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel