I'm aware of the problem, but it will be fixed when the PMIx branch is merged later this week.
On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote: > Folks, > > let's look at the following trivial test program : > > #include <mpi.h> > #include <stdio.h> > > int main (int argc, char * argv[]) { > int rank, size; > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &size); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > printf ("I am %d/%d and i abort\n", rank, size); > MPI_Abort(MPI_COMM_WORLD, 2); > printf ("%d/%d aborted !\n", rank, size); > return 3; > } > > and let's run mpirun (trunk) on node0 and ask the mpi task to run on > task 1 : > with two tasks or more : > > node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 2. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > I am 1/2 and i abort > I am 0/2 and i abort > [node0:00740] 1 more process has sent help message help-mpi-api.txt / > mpi-abort > [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > node0 $ echo $? > 0 > > the exit status of mpirun is zero > /* this is why the MPI_Errhandler_fatal_c test fails in mtt */ > > now if we run only one task : > > node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort > I am 0/1 and i abort > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 2. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun has exited due to process rank 0 with PID 15884 on > node node1 exiting improperly. There are three reasons this could occur: > > 1. this process did not call "init" before exiting, but others in > the job did. This can cause a job to hang indefinitely while it waits > for all processes to call "init". By rule, if one process calls "init", > then ALL processes must call "init" prior to termination. > > 2. this process called "init", but exited without calling "finalize". > By rule, all processes that call "init" MUST call "finalize" prior to > exiting or it will be considered an "abnormal termination" > > 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter > orte_create_session_dirs is set to false. In this case, the run-time cannot > detect that the abort call was an abnormal termination. Hence, the only > error message you will receive is this one. > > This may have caused other processes in the application to be > terminated by signals sent by mpirun (as reported here). > > You can avoid this message by specifying -quiet on the mpirun command line. > > -------------------------------------------------------------------------- > node0 $ echo $? > 1 > > the program displayed a misleading error message and mpirun exited with > error code 1 > /* i would have expected 2, or 3 in the worst case scenario */ > > > i digged it a bit and found a kind of race condition in orted (running > on node 1) > basically, when the process dies, it writes stuff in the openmpi session > directory and exits. > exiting send a SIGCHLD to orted and close the socket/pipe connected to > orted. > on orted, the loss of connection is generally processed before the > SIGCHLD by libevent, > and as a consequence, the exit code is not correctly set (e.g. it is > left to zero). > i did not see any kind of communication between the mpi task and orted > (except writing a file in the openmpi session directory) as i would have > expected > /* but this was just my initial guess, the truth is i do not know what > is supposed to happen */ > > i wrote the attached abort.patch patch to basically get it working. > i highly suspect this is not the right thing to do so i did not commit it. > > it works fine with two tasks or more. > with only one task, mpirun display a misleading error message but the > exit status is ok. > > could someone (Ralph ?) have a look at this ? > > Cheers, > > Gilles > > > node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort > I am 1/2 and i abort > I am 0/2 and i abort > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 2. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > [node0:00920] 1 more process has sent help message help-mpi-api.txt / > mpi-abort > [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > node0 $ echo $? > 2 > > > > node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort > I am 0/1 and i abort > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 2. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > ------------------------------------------------------- > Primary job terminated normally, but 1 process returned > a non-zero exit code.. Per user-direction, the job has been aborted. > ------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun detected that one or more processes exited with non-zero status, > thus causing > the job to be terminated. The first process to do so was: > > Process name: [[7955,1],0] > Exit code: 2 > -------------------------------------------------------------------------- > node0 $ echo $? > 2 > > > > <abort.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15666.php