I think these are fixed now - at least, your test cases all pass for me
On Aug 22, 2014, at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote: > > On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet > <gilles.gouaillar...@gmail.com> wrote: > >> Ralph, >> >> Will do on Monday >> >> About the first test, in my case echo $? returns 0 > > My "showcode" is just an alias for the echo > >> I noticed this confusing message in your output : >> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on >> signal 0 (Unknown signal 0). > > I'll take a look at why that happened > >> >> About the second test, please note my test program return 3; >> whereas your mpi_no_op.c return 0; > > I didn't see that little cuteness - sigh > >> >> Cheers, >> >> Gilles >> >> Ralph Castain <r...@open-mpi.org> wrote: >> You might want to try again with current head of trunk as something seems >> off in what you are seeing - more below >> >> >> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Ralph, >>> >>> i tried again after the merge and found the same behaviour, though the >>> internals are very different. >>> >>> i run without any batch manager >>> >>> from node0: >>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort >>> >>> exit with exit code zero :-( >> >> Hmmm...it works fine for me, without your patch: >> >> 07:35:41 $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort >> Hello, World, I am 0 of 1 >> -------------------------------------------------------------------------- >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >> with errorcode 2. >> >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >> You may or may not see output from other processes, depending on >> exactly when Open MPI kills them. >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on >> signal 0 (Unknown signal 0). >> -------------------------------------------------------------------------- >> 07:35:56 $ showcode >> 130 >> >>> >>> short story : i applied pmix.2.patch and that fixed my problem >>> could you please review this ? >>> >>> long story : >>> i initially applied pmix.1.patch and it solved my problem >>> then i ran >>> mpirun -np 1 --mca btl openib,self -host node1 ./abort >>> and i came back to square one : exit code is zero >>> so i used the debugger and was unable to reproduce the issue >>> (one more race condition, yeah !) >>> finally, i wrote pmix.2.patch, fixed my issue and realized that >>> pmix.1.patch was no more needed. >>> currently, and assuming pmix.2.patch is correct, i cannot tell wether >>> pmix.1.patch is needed or not >>> since this part of the code is no more executed. >>> >>> i also found one hang with the following trivial program within one node : >>> >>> int main (int argc, char *argv[]) { >>> MPI_Init(&argc, &argv); >>> MPI_Finalize(); >>> return 3; >>> } >>> >>> from node0 : >>> $ mpirun -np 1 ./test >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> >>> AND THE PROGRAM HANGS >> >> This also works fine for me: >> >> 07:37:27 $ mpirun -n 1 ./mpi_no_op >> 07:37:36 $ cat mpi_no_op.c >> /* -*- C -*- >> * >> * $HEADER$ >> * >> * The most basic of MPI applications >> */ >> >> #include <stdio.h> >> #include "mpi.h" >> >> int main(int argc, char* argv[]) >> { >> MPI_Init(&argc, &argv); >> >> MPI_Finalize(); >> return 0; >> } >> >> >>> >>> *but* >>> $ mpirun -np 1 -host node1 ./test >>> ------------------------------------------------------- >>> Primary job terminated normally, but 1 process returned >>> a non-zero exit code.. Per user-direction, the job has been aborted. >>> ------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> mpirun detected that one or more processes exited with non-zero status, >>> thus causing >>> the job to be terminated. The first process to do so was: >>> >>> Process name: [[22080,1],0] >>> Exit code: 3 >>> -------------------------------------------------------------------------- >>> >>> return with exit code 3. >> >> Likewise here - works just fine for me >> >> >>> >>> then i found a strange behaviour with helloworld if only the self btl is >>> used : >>> $ mpirun -np 1 --mca btl self ./hw >>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3 >>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in >>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at >>> line 722 >>> >>> the program returns with exit code zero, but display an error message. >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/08/21 6:21, Ralph Castain wrote: >>>> I'm aware of the problem, but it will be fixed when the PMIx branch is >>>> merged later this week. >>>> >>>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet >>>> <gilles.gouaillar...@iferc.org> wrote: >>>> >>>>> Folks, >>>>> >>>>> let's look at the following trivial test program : >>>>> >>>>> #include <mpi.h> >>>>> #include <stdio.h> >>>>> >>>>> int main (int argc, char * argv[]) { >>>>> int rank, size; >>>>> MPI_Init(&argc, &argv); >>>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>>> printf ("I am %d/%d and i abort\n", rank, size); >>>>> MPI_Abort(MPI_COMM_WORLD, 2); >>>>> printf ("%d/%d aborted !\n", rank, size); >>>>> return 3; >>>>> } >>>>> >>>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on >>>>> task 1 : >>>>> with two tasks or more : >>>>> >>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>>>> -------------------------------------------------------------------------- >>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>> with errorcode 2. >>>>> >>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>> You may or may not see output from other processes, depending on >>>>> exactly when Open MPI kills them. >>>>> -------------------------------------------------------------------------- >>>>> I am 1/2 and i abort >>>>> I am 0/2 and i abort >>>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt / >>>>> mpi-abort >>>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>> all help / error messages >>>>> >>>>> node0 $ echo $? >>>>> 0 >>>>> >>>>> the exit status of mpirun is zero >>>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */ >>>>> >>>>> now if we run only one task : >>>>> >>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>>>> I am 0/1 and i abort >>>>> -------------------------------------------------------------------------- >>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>> with errorcode 2. >>>>> >>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>> You may or may not see output from other processes, depending on >>>>> exactly when Open MPI kills them. >>>>> -------------------------------------------------------------------------- >>>>> -------------------------------------------------------------------------- >>>>> mpirun has exited due to process rank 0 with PID 15884 on >>>>> node node1 exiting improperly. There are three reasons this could occur: >>>>> >>>>> 1. this process did not call "init" before exiting, but others in >>>>> the job did. This can cause a job to hang indefinitely while it waits >>>>> for all processes to call "init". By rule, if one process calls "init", >>>>> then ALL processes must call "init" prior to termination. >>>>> >>>>> 2. this process called "init", but exited without calling "finalize". >>>>> By rule, all processes that call "init" MUST call "finalize" prior to >>>>> exiting or it will be considered an "abnormal termination" >>>>> >>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >>>>> orte_create_session_dirs is set to false. In this case, the run-time >>>>> cannot >>>>> detect that the abort call was an abnormal termination. Hence, the only >>>>> error message you will receive is this one. >>>>> >>>>> This may have caused other processes in the application to be >>>>> terminated by signals sent by mpirun (as reported here). >>>>> >>>>> You can avoid this message by specifying -quiet on the mpirun command >>>>> line. >>>>> >>>>> -------------------------------------------------------------------------- >>>>> node0 $ echo $? >>>>> 1 >>>>> >>>>> the program displayed a misleading error message and mpirun exited with >>>>> error code 1 >>>>> /* i would have expected 2, or 3 in the worst case scenario */ >>>>> >>>>> >>>>> i digged it a bit and found a kind of race condition in orted (running >>>>> on node 1) >>>>> basically, when the process dies, it writes stuff in the openmpi session >>>>> directory and exits. >>>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to >>>>> orted. >>>>> on orted, the loss of connection is generally processed before the >>>>> SIGCHLD by libevent, >>>>> and as a consequence, the exit code is not correctly set (e.g. it is >>>>> left to zero). >>>>> i did not see any kind of communication between the mpi task and orted >>>>> (except writing a file in the openmpi session directory) as i would have >>>>> expected >>>>> /* but this was just my initial guess, the truth is i do not know what >>>>> is supposed to happen */ >>>>> >>>>> i wrote the attached abort.patch patch to basically get it working. >>>>> i highly suspect this is not the right thing to do so i did not commit it. >>>>> >>>>> it works fine with two tasks or more. >>>>> with only one task, mpirun display a misleading error message but the >>>>> exit status is ok. >>>>> >>>>> could someone (Ralph ?) have a look at this ? >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> >>>>> >>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>>>> I am 1/2 and i abort >>>>> I am 0/2 and i abort >>>>> -------------------------------------------------------------------------- >>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>> with errorcode 2. >>>>> >>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>> You may or may not see output from other processes, depending on >>>>> exactly when Open MPI kills them. >>>>> -------------------------------------------------------------------------- >>>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt / >>>>> mpi-abort >>>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>>> all help / error messages >>>>> node0 $ echo $? >>>>> 2 >>>>> >>>>> >>>>> >>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>>>> I am 0/1 and i abort >>>>> -------------------------------------------------------------------------- >>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>> with errorcode 2. >>>>> >>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>> You may or may not see output from other processes, depending on >>>>> exactly when Open MPI kills them. >>>>> -------------------------------------------------------------------------- >>>>> ------------------------------------------------------- >>>>> Primary job terminated normally, but 1 process returned >>>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>>> ------------------------------------------------------- >>>>> -------------------------------------------------------------------------- >>>>> mpirun detected that one or more processes exited with non-zero status, >>>>> thus causing >>>>> the job to be terminated. The first process to do so was: >>>>> >>>>> Process name: [[7955,1],0] >>>>> Exit code: 2 >>>>> -------------------------------------------------------------------------- >>>>> node0 $ echo $? >>>>> 2 >>>>> >>>>> >>>>> >>>>> <abort.patch>_______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php >>> >>> <pmix.1.patch><pmix.2.patch>_______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15692.php >