[OMPI devel] ORTED process group

2016-02-23 Thread Ralph Castain
Hello all

The question was raised at today's developer workshop about our current
practice of putting the application processes in a separate process group
from their parent ORTE daemon. This has the unfortunate side effect of
making the processes "invisible" to any host resource manager when they are
launched via mpirun - i.e., the RM launches the orted's, but never sees the
local application procs that the orted fork/exec's. Since those processes
are then moved into a separate process group, the host RM has no way of
killing them should the orted fail and the procs not suicide.

The request was made that we modify the orted so it no longer changes the
application proc's process group. This will leave the orted and the
application procs in the same process group, and so any signals delivered
by the host RM to the orted will be received by all processes.

However, in reviewing the code, I (re)discovered why this was originally
done. The issue stems from when Sun joined the OMPI project - their MPI
implementation allowed the user to pause their job by hitting mpirun with a
SIGTSTP, and then start again by hitting mpirun with a SIGCNT. These
signals needed to be seen not just by the initial child processes started
by the orted, but also by any subsequent child processes those processes
might have started.

It is this latter point that led to the process group change. Since the
"grandchild" processes were not started by the orted, the orted itself has
no knowledge of their pid. Thus, the orted cannot send the SIGSTP to the
individual target pid's. However, if the orted hits the "leader of the
process group that contains its children", then that signal would also hit
the orted - thus causing the orted to "pause". There would be no way for
mpirun to "wake up" the orted after that point so it could subsequently
"unpause" the application.

Hence the decision was made to move the application procs into their own
process group. The orted can then signal the process group, thus ensuring
that all procs (grandchildren etc.) receive the signal - without disabling
the orted itself.

If we want to retain this pause/restart behavior, then I see no way to
change the current method of putting the application procs into their own
process group. So I guess this issue becomes a choice:

* either we disable pause/restart by signal
* someone comes up with an alternative way of "pausing" the processes,
including any descendants, without disturbing the orted...or devise a
scheme for waking the orted up after it has been "paused". PMIx didn't
exist back then, but perhaps we might be able to use it to help us here
(e.g., a PMIx API to tell it to hit our orteds with a SIGCNT)?

Suggestions?
Ralph


Re: [OMPI devel] ORTED process group

2016-02-23 Thread Gilles Gouaillardet

Ralph,

my 0.02 US$

if i understand correctly, we put non-ORTE processes into a different 
process group because
ORTE *might* have grand-children and their progeny, and ORTE does not / 
cannot know about.
/* note we assume here these processes are all well raised and do not 
create yet an other

process group */

my first impression is that in the case of OpenMPI (i know ORTE is not 
exclusively used by OMPI),
that should be quite uncommon. forked used to fail with the openib btl, 
and i am under the

impression endusers still believe an MPI task cannot fork.

i would be fine to have a MCA parameter controls whether we want a new 
process group or not.
iirc, we still "wrap" the fork syscall, so we could also issue a warning 
if an MPI task forks.

(e.g. grandchildren won't be signaled)

as you pointed, If ORTE children are in the same process group than 
orted, then we cannot signal the process
group leader (e.g. orted SIGSTP itself). then what about individually 
signaling all the processes in the process group *except* orted ?


fwiw, on linux, modern RM can use cgroups to terminate *all* processes 
related to a given job, and regardless process/session group. i think 
slurm can do that, and PBSPro also does that (but maybe only on SGI 
machines)


Cheers,

Gilles

On 2/24/2016 11:01 AM, Ralph Castain wrote:

Hello all

The question was raised at today's developer workshop about our 
current practice of putting the application processes in a separate 
process group from their parent ORTE daemon. This has the unfortunate 
side effect of making the processes "invisible" to any host resource 
manager when they are launched via mpirun - i.e., the RM launches the 
orted's, but never sees the local application procs that the orted 
fork/exec's. Since those processes are then moved into a separate 
process group, the host RM has no way of killing them should the orted 
fail and the procs not suicide.


The request was made that we modify the orted so it no longer changes 
the application proc's process group. This will leave the orted and 
the application procs in the same process group, and so any signals 
delivered by the host RM to the orted will be received by all processes.


However, in reviewing the code, I (re)discovered why this was 
originally done. The issue stems from when Sun joined the OMPI project 
- their MPI implementation allowed the user to pause their job by 
hitting mpirun with a SIGTSTP, and then start again by hitting mpirun 
with a SIGCNT. These signals needed to be seen not just by the initial 
child processes started by the orted, but also by any subsequent child 
processes those processes might have started.


It is this latter point that led to the process group change. Since 
the "grandchild" processes were not started by the orted, the orted 
itself has no knowledge of their pid. Thus, the orted cannot send the 
SIGSTP to the individual target pid's. However, if the orted hits the 
"leader of the process group that contains its children", then that 
signal would also hit the orted - thus causing the orted to "pause". 
There would be no way for mpirun to "wake up" the orted after that 
point so it could subsequently "unpause" the application.


Hence the decision was made to move the application procs into their 
own process group. The orted can then signal the process group, thus 
ensuring that all procs (grandchildren etc.) receive the signal - 
without disabling the orted itself.


If we want to retain this pause/restart behavior, then I see no way to 
change the current method of putting the application procs into their 
own process group. So I guess this issue becomes a choice:


* either we disable pause/restart by signal
* someone comes up with an alternative way of "pausing" the processes, 
including any descendants, without disturbing the orted...or devise a 
scheme for waking the orted up after it has been "paused". PMIx didn't 
exist back then, but perhaps we might be able to use it to help us 
here (e.g., a PMIx API to tell it to hit our orteds with a SIGCNT)?


Suggestions?
Ralph



___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18612.php