Ralph,

my 0.02 US$

if i understand correctly, we put non-ORTE processes into a different process group because ORTE *might* have grand-children and their progeny, and ORTE does not / cannot know about. /* note we assume here these processes are all well raised and do not create yet an other
process group */

my first impression is that in the case of OpenMPI (i know ORTE is not exclusively used by OMPI), that should be quite uncommon. forked used to fail with the openib btl, and i am under the
impression endusers still believe an MPI task cannot fork.

i would be fine to have a MCA parameter controls whether we want a new process group or not. iirc, we still "wrap" the fork syscall, so we could also issue a warning if an MPI task forks.
(e.g. grandchildren won't be signaled)

as you pointed, If ORTE children are in the same process group than orted, then we cannot signal the process group leader (e.g. orted SIGSTP itself). then what about individually signaling all the processes in the process group *except* orted ?

fwiw, on linux, modern RM can use cgroups to terminate *all* processes related to a given job, and regardless process/session group. i think slurm can do that, and PBSPro also does that (but maybe only on SGI machines)

Cheers,

Gilles

On 2/24/2016 11:01 AM, Ralph Castain wrote:
Hello all

The question was raised at today's developer workshop about our current practice of putting the application processes in a separate process group from their parent ORTE daemon. This has the unfortunate side effect of making the processes "invisible" to any host resource manager when they are launched via mpirun - i.e., the RM launches the orted's, but never sees the local application procs that the orted fork/exec's. Since those processes are then moved into a separate process group, the host RM has no way of killing them should the orted fail and the procs not suicide.

The request was made that we modify the orted so it no longer changes the application proc's process group. This will leave the orted and the application procs in the same process group, and so any signals delivered by the host RM to the orted will be received by all processes.

However, in reviewing the code, I (re)discovered why this was originally done. The issue stems from when Sun joined the OMPI project - their MPI implementation allowed the user to pause their job by hitting mpirun with a SIGTSTP, and then start again by hitting mpirun with a SIGCNT. These signals needed to be seen not just by the initial child processes started by the orted, but also by any subsequent child processes those processes might have started.

It is this latter point that led to the process group change. Since the "grandchild" processes were not started by the orted, the orted itself has no knowledge of their pid. Thus, the orted cannot send the SIGSTP to the individual target pid's. However, if the orted hits the "leader of the process group that contains its children", then that signal would also hit the orted - thus causing the orted to "pause". There would be no way for mpirun to "wake up" the orted after that point so it could subsequently "unpause" the application.

Hence the decision was made to move the application procs into their own process group. The orted can then signal the process group, thus ensuring that all procs (grandchildren etc.) receive the signal - without disabling the orted itself.

If we want to retain this pause/restart behavior, then I see no way to change the current method of putting the application procs into their own process group. So I guess this issue becomes a choice:

* either we disable pause/restart by signal
* someone comes up with an alternative way of "pausing" the processes, including any descendants, without disturbing the orted...or devise a scheme for waking the orted up after it has been "paused". PMIx didn't exist back then, but perhaps we might be able to use it to help us here (e.g., a PMIx API to tell it to hit our orteds with a SIGCNT)?

Suggestions?
Ralph



_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/02/18612.php

Reply via email to