I'm aware of the problem, but it will be fixed when the PMIx branch is merged 
later this week.

On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
<gilles.gouaillar...@iferc.org> wrote:

> Folks,
> 
> let's look at the following trivial test program :
> 
> #include <mpi.h>
> #include <stdio.h>
> 
> int main (int argc, char * argv[]) {
>    int rank, size;
>    MPI_Init(&argc, &argv);
>    MPI_Comm_size(MPI_COMM_WORLD, &size);
>    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>    printf ("I am %d/%d and i abort\n", rank, size);
>    MPI_Abort(MPI_COMM_WORLD, 2);
>    printf ("%d/%d aborted !\n", rank, size);
>    return 3;
> }
> 
> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
> task 1 :
> with two tasks or more :
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> I am 1/2 and i abort
> I am 0/2 and i abort
> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
> mpi-abort
> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> 
> node0 $ echo $?
> 0
> 
> the exit status of mpirun is zero
> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
> 
> now if we run only one task :
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
> I am 0/1 and i abort
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun has exited due to process rank 0 with PID 15884 on
> node node1 exiting improperly. There are three reasons this could occur:
> 
> 1. this process did not call "init" before exiting, but others in
> the job did. This can cause a job to hang indefinitely while it waits
> for all processes to call "init". By rule, if one process calls "init",
> then ALL processes must call "init" prior to termination.
> 
> 2. this process called "init", but exited without calling "finalize".
> By rule, all processes that call "init" MUST call "finalize" prior to
> exiting or it will be considered an "abnormal termination"
> 
> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
> orte_create_session_dirs is set to false. In this case, the run-time cannot
> detect that the abort call was an abnormal termination. Hence, the only
> error message you will receive is this one.
> 
> This may have caused other processes in the application to be
> terminated by signals sent by mpirun (as reported here).
> 
> You can avoid this message by specifying -quiet on the mpirun command line.
> 
> --------------------------------------------------------------------------
> node0 $ echo $?
> 1
> 
> the program displayed a misleading error message and mpirun exited with
> error code 1
> /* i would have expected 2, or 3 in the worst case scenario */
> 
> 
> i digged it a bit and found a kind of race condition in orted (running
> on node 1)
> basically, when the process dies, it writes stuff in the openmpi session
> directory and exits.
> exiting send a SIGCHLD to orted and close the socket/pipe connected to
> orted.
> on orted, the loss of connection is generally processed before the
> SIGCHLD by libevent,
> and as a consequence, the exit code is not correctly set (e.g. it is
> left to zero).
> i did not see any kind of communication between the mpi task and orted
> (except writing a file in the openmpi session directory) as i would have
> expected
> /* but this was just my initial guess, the truth is i do not know what
> is supposed to happen */
> 
> i wrote the attached abort.patch patch to basically get it working.
> i highly suspect this is not the right thing to do so i did not commit it.
> 
> it works fine with two tasks or more.
> with only one task, mpirun display a misleading error message but the
> exit status is ok.
> 
> could someone (Ralph ?) have a look at this ?
> 
> Cheers,
> 
> Gilles
> 
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
> I am 1/2 and i abort
> I am 0/2 and i abort
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
> mpi-abort
> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> node0 $ echo $?
> 2
> 
> 
> 
> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
> I am 0/1 and i abort
> --------------------------------------------------------------------------
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode 2.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --------------------------------------------------------------------------
> -------------------------------------------------------
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> -------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
> 
>  Process name: [[7955,1],0]
>  Exit code:    2
> --------------------------------------------------------------------------
> node0 $ echo $?
> 2
> 
> 
> 
> <abort.patch>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php

Reply via email to