Ralph, Will do on Monday
About the first test, in my case echo $? returns 0 I noticed this confusing message in your output : mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on signal 0 (Unknown signal 0). About the second test, please note my test program return 3; whereas your mpi_no_op.c return 0; Cheers, Gilles Ralph Castain <r...@open-mpi.org> wrote: >You might want to try again with current head of trunk as something seems off >in what you are seeing - more below > > > >On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org> wrote: > > >Ralph, > >i tried again after the merge and found the same behaviour, though the >internals are very different. > >i run without any batch manager > >from node0: >mpirun -np 1 --mca btl tcp,self -host node1 ./abort > >exit with exit code zero :-( > > >Hmmm...it works fine for me, without your patch: > > >07:35:41 $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort > >Hello, World, I am 0 of 1 > >-------------------------------------------------------------------------- > >MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > >with errorcode 2. > > >NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > >You may or may not see output from other processes, depending on > >exactly when Open MPI kills them. > >-------------------------------------------------------------------------- > >-------------------------------------------------------------------------- > >mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on >signal 0 (Unknown signal 0). > >-------------------------------------------------------------------------- > >07:35:56 $ showcode > >130 > > > >short story : i applied pmix.2.patch and that fixed my problem >could you please review this ? > >long story : >i initially applied pmix.1.patch and it solved my problem >then i ran >mpirun -np 1 --mca btl openib,self -host node1 ./abort >and i came back to square one : exit code is zero >so i used the debugger and was unable to reproduce the issue >(one more race condition, yeah !) >finally, i wrote pmix.2.patch, fixed my issue and realized that >pmix.1.patch was no more needed. >currently, and assuming pmix.2.patch is correct, i cannot tell wether >pmix.1.patch is needed or not >since this part of the code is no more executed. > >i also found one hang with the following trivial program within one node : > >int main (int argc, char *argv[]) { > MPI_Init(&argc, &argv); > MPI_Finalize(); > return 3; >} > >from node0 : >$ mpirun -np 1 ./test >------------------------------------------------------- >Primary job terminated normally, but 1 process returned >a non-zero exit code.. Per user-direction, the job has been aborted. >------------------------------------------------------- > >AND THE PROGRAM HANGS > > >This also works fine for me: > > >07:37:27 $ mpirun -n 1 ./mpi_no_op > >07:37:36 $ cat mpi_no_op.c > >/* -*- C -*- > > * > > * $HEADER$ > > * > > * The most basic of MPI applications > > */ > > >#include <stdio.h> > >#include "mpi.h" > > >int main(int argc, char* argv[]) > >{ > > MPI_Init(&argc, &argv); > > > MPI_Finalize(); > > return 0; > >} > > > > >*but* >$ mpirun -np 1 -host node1 ./test >------------------------------------------------------- >Primary job terminated normally, but 1 process returned >a non-zero exit code.. Per user-direction, the job has been aborted. >------------------------------------------------------- >-------------------------------------------------------------------------- >mpirun detected that one or more processes exited with non-zero status, >thus causing >the job to be terminated. The first process to do so was: > > Process name: [[22080,1],0] > Exit code: 3 >-------------------------------------------------------------------------- > >return with exit code 3. > > >Likewise here - works just fine for me > > > > >then i found a strange behaviour with helloworld if only the self btl is >used : >$ mpirun -np 1 --mca btl self ./hw >[helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3 >[helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in >file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at >line 722 > >the program returns with exit code zero, but display an error message. > >Cheers, > >Gilles > >On 2014/08/21 6:21, Ralph Castain wrote: > >I'm aware of the problem, but it will be fixed when the PMIx branch is merged >later this week. > >On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet ><gilles.gouaillar...@iferc.org> wrote: > >Folks, > >let's look at the following trivial test program : > >#include <mpi.h> >#include <stdio.h> > >int main (int argc, char * argv[]) { > int rank, size; > MPI_Init(&argc, &argv); > MPI_Comm_size(MPI_COMM_WORLD, &size); > MPI_Comm_rank(MPI_COMM_WORLD, &rank); > printf ("I am %d/%d and i abort\n", rank, size); > MPI_Abort(MPI_COMM_WORLD, 2); > printf ("%d/%d aborted !\n", rank, size); > return 3; >} > >and let's run mpirun (trunk) on node0 and ask the mpi task to run on >task 1 : >with two tasks or more : > >node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >-------------------------------------------------------------------------- >MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >with errorcode 2. > >NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >You may or may not see output from other processes, depending on >exactly when Open MPI kills them. >-------------------------------------------------------------------------- >I am 1/2 and i abort >I am 0/2 and i abort >[node0:00740] 1 more process has sent help message help-mpi-api.txt / >mpi-abort >[node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see >all help / error messages > >node0 $ echo $? >0 > >the exit status of mpirun is zero >/* this is why the MPI_Errhandler_fatal_c test fails in mtt */ > >now if we run only one task : > >node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >I am 0/1 and i abort >-------------------------------------------------------------------------- >MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >with errorcode 2. > >NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >You may or may not see output from other processes, depending on >exactly when Open MPI kills them. >-------------------------------------------------------------------------- >-------------------------------------------------------------------------- >mpirun has exited due to process rank 0 with PID 15884 on >node node1 exiting improperly. There are three reasons this could occur: > >1. this process did not call "init" before exiting, but others in >the job did. This can cause a job to hang indefinitely while it waits >for all processes to call "init". By rule, if one process calls "init", >then ALL processes must call "init" prior to termination. > >2. this process called "init", but exited without calling "finalize". >By rule, all processes that call "init" MUST call "finalize" prior to >exiting or it will be considered an "abnormal termination" > >3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >orte_create_session_dirs is set to false. In this case, the run-time cannot >detect that the abort call was an abnormal termination. Hence, the only >error message you will receive is this one. > >This may have caused other processes in the application to be >terminated by signals sent by mpirun (as reported here). > >You can avoid this message by specifying -quiet on the mpirun command line. > >-------------------------------------------------------------------------- >node0 $ echo $? >1 > >the program displayed a misleading error message and mpirun exited with >error code 1 >/* i would have expected 2, or 3 in the worst case scenario */ > > >i digged it a bit and found a kind of race condition in orted (running >on node 1) >basically, when the process dies, it writes stuff in the openmpi session >directory and exits. >exiting send a SIGCHLD to orted and close the socket/pipe connected to >orted. >on orted, the loss of connection is generally processed before the >SIGCHLD by libevent, >and as a consequence, the exit code is not correctly set (e.g. it is >left to zero). >i did not see any kind of communication between the mpi task and orted >(except writing a file in the openmpi session directory) as i would have >expected >/* but this was just my initial guess, the truth is i do not know what >is supposed to happen */ > >i wrote the attached abort.patch patch to basically get it working. >i highly suspect this is not the right thing to do so i did not commit it. > >it works fine with two tasks or more. >with only one task, mpirun display a misleading error message but the >exit status is ok. > >could someone (Ralph ?) have a look at this ? > >Cheers, > >Gilles > > >node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >I am 1/2 and i abort >I am 0/2 and i abort >-------------------------------------------------------------------------- >MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >with errorcode 2. > >NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >You may or may not see output from other processes, depending on >exactly when Open MPI kills them. >-------------------------------------------------------------------------- >[node0:00920] 1 more process has sent help message help-mpi-api.txt / >mpi-abort >[node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see >all help / error messages >node0 $ echo $? >2 > > > >node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >I am 0/1 and i abort >-------------------------------------------------------------------------- >MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >with errorcode 2. > >NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >You may or may not see output from other processes, depending on >exactly when Open MPI kills them. >-------------------------------------------------------------------------- >------------------------------------------------------- >Primary job terminated normally, but 1 process returned >a non-zero exit code.. Per user-direction, the job has been aborted. >------------------------------------------------------- >-------------------------------------------------------------------------- >mpirun detected that one or more processes exited with non-zero status, >thus causing >the job to be terminated. The first process to do so was: > >Process name: [[7955,1],0] >Exit code: 2 >-------------------------------------------------------------------------- >node0 $ echo $? >2 > > > ><abort.patch>_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/08/15666.php > >_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/08/15672.php > > ><pmix.1.patch><pmix.2.patch>_______________________________________________ >devel mailing list >de...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >Link to this post: >http://www.open-mpi.org/community/lists/devel/2014/08/15689.php > >