That is strange. If your procs are trapping the signal, then it should be
okay - at least, my signal traps are operating cleanly in Mac, TM, and SLURM
environments.

Let me know if you see anything further and maybe we can figure out why it
is behaving that way.

Ralph



On 4/17/08 2:03 PM, "Richard Graham" <rlgra...@ornl.gov> wrote:

> Ralph,
>   Thanks for looking into this.  I do not think that the behaviour needs to
> change - it is correct.  However, for some reason this is not how things
> were running for me -  I wander what the difference is.  I worked around
> this by getting the pid's of the mpi processes, and delivered the signals
> directly to them, so was able to avoid the kill, and this was sufficient for
> me.
> 
> Thanks again,
> Rich
> 
> 
> On 4/17/08 3:23 PM, "Ralph Castain" <r...@lanl.gov> wrote:
> 
>> The question was raised on this list a short while ago about potentially
>> incorrect behavior by ORTE/OMPI in response to SIGUSR2 being sent to
>> application procs. I have spent some time chasing this down, and it does
>> -not- appear to be an issue within our systems.
>> 
>> What I have found is that if you send a SIGUSR1/2 to mpirun, mpirun and the
>> daemons correctly transmit the provided signal to the application processes.
>> Neither mpirun nor the daemons directly respond to it themselves.
>> 
>> 
>> If the application process has defined its own signal handler to trap
>> USR1/2, then the application process will successfully do so. Everything
>> seems to work fine - the daemon does -not- get a callback nor in any way
>> take action to the fact that the proc received this signal - unless the
>> process' signal handler orders the process to exit! In this case, the
>> environment reports to the orted that the process exit'd during a signal
>> handler, which results in a terminated-by-signal status.
>> 
>> You can, of course, get around this by simply not exiting from within the
>> signal handler. Instead, set a flag and return from the handler, then have
>> an appropriate routine check the flag and exit. I have done that in several
>> codes and would be happy to advise you on how to do it. With this technique,
>> you clear the signal and the environment will not report you as
>> terminated-by-signal.
>> 
>> 
>> However, if the application process has -not- defined its own signal
>> handler, some native environments terminate the process when it receives
>> SIGUSR1/2! This occurred for me under SLURM on the odin cluster, and under
>> TM on our RRZ cluster. I cannot say it is a universal situation and would
>> welcome more feedback from people with access to other environments.
>> 
>> This termination is dutifully reported to the orted, which notes that the
>> proc was terminated-by-signal. The orted does not check to see -which-
>> signal was used to terminate the proc.
>> 
>> 
>> By our own design requirements, the response to a termination-by-signal of a
>> process is to abort the job. If we want to modify that, it would be simple
>> to say "except if it was a SIGUSR1/2 signal". I have no issue with making
>> that change, but please note that it -is- a change in our defined behavior,
>> and a change from what has been our behavior since the beginning of the
>> project.
>> 
>> Let me know if you want to change the design requirement and we can take
>> care of it.
>> 
>> Thanks
>> Ralph
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to