Ralph,
my 0.02 US$
if i understand correctly, we put non-ORTE processes into a different
process group because
ORTE *might* have grand-children and their progeny, and ORTE does not /
cannot know about.
/* note we assume here these processes are all well raised and do not
create yet an other
process group */
my first impression is that in the case of OpenMPI (i know ORTE is not
exclusively used by OMPI),
that should be quite uncommon. forked used to fail with the openib btl,
and i am under the
impression endusers still believe an MPI task cannot fork.
i would be fine to have a MCA parameter controls whether we want a new
process group or not.
iirc, we still "wrap" the fork syscall, so we could also issue a warning
if an MPI task forks.
(e.g. grandchildren won't be signaled)
as you pointed, If ORTE children are in the same process group than
orted, then we cannot signal the process
group leader (e.g. orted SIGSTP itself). then what about individually
signaling all the processes in the process group *except* orted ?
fwiw, on linux, modern RM can use cgroups to terminate *all* processes
related to a given job, and regardless process/session group. i think
slurm can do that, and PBSPro also does that (but maybe only on SGI
machines)
Cheers,
Gilles
On 2/24/2016 11:01 AM, Ralph Castain wrote:
Hello all
The question was raised at today's developer workshop about our
current practice of putting the application processes in a separate
process group from their parent ORTE daemon. This has the unfortunate
side effect of making the processes "invisible" to any host resource
manager when they are launched via mpirun - i.e., the RM launches the
orted's, but never sees the local application procs that the orted
fork/exec's. Since those processes are then moved into a separate
process group, the host RM has no way of killing them should the orted
fail and the procs not suicide.
The request was made that we modify the orted so it no longer changes
the application proc's process group. This will leave the orted and
the application procs in the same process group, and so any signals
delivered by the host RM to the orted will be received by all processes.
However, in reviewing the code, I (re)discovered why this was
originally done. The issue stems from when Sun joined the OMPI project
- their MPI implementation allowed the user to pause their job by
hitting mpirun with a SIGTSTP, and then start again by hitting mpirun
with a SIGCNT. These signals needed to be seen not just by the initial
child processes started by the orted, but also by any subsequent child
processes those processes might have started.
It is this latter point that led to the process group change. Since
the "grandchild" processes were not started by the orted, the orted
itself has no knowledge of their pid. Thus, the orted cannot send the
SIGSTP to the individual target pid's. However, if the orted hits the
"leader of the process group that contains its children", then that
signal would also hit the orted - thus causing the orted to "pause".
There would be no way for mpirun to "wake up" the orted after that
point so it could subsequently "unpause" the application.
Hence the decision was made to move the application procs into their
own process group. The orted can then signal the process group, thus
ensuring that all procs (grandchildren etc.) receive the signal -
without disabling the orted itself.
If we want to retain this pause/restart behavior, then I see no way to
change the current method of putting the application procs into their
own process group. So I guess this issue becomes a choice:
* either we disable pause/restart by signal
* someone comes up with an alternative way of "pausing" the processes,
including any descendants, without disturbing the orted...or devise a
scheme for waking the orted up after it has been "paused". PMIx didn't
exist back then, but perhaps we might be able to use it to help us
here (e.g., a PMIx API to tell it to hit our orteds with a SIGCNT)?
Suggestions?
Ralph
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/02/18612.php