On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet <[email protected]> wrote:
> Ralph, > > Will do on Monday > > About the first test, in my case echo $? returns 0 My "showcode" is just an alias for the echo > I noticed this confusing message in your output : > mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on > signal 0 (Unknown signal 0). I'll take a look at why that happened > > About the second test, please note my test program return 3; > whereas your mpi_no_op.c return 0; I didn't see that little cuteness - sigh > > Cheers, > > Gilles > > Ralph Castain <[email protected]> wrote: > You might want to try again with current head of trunk as something seems off > in what you are seeing - more below > > > On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet > <[email protected]> wrote: > >> Ralph, >> >> i tried again after the merge and found the same behaviour, though the >> internals are very different. >> >> i run without any batch manager >> >> from node0: >> mpirun -np 1 --mca btl tcp,self -host node1 ./abort >> >> exit with exit code zero :-( > > Hmmm...it works fine for me, without your patch: > > 07:35:41 $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort > Hello, World, I am 0 of 1 > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 2. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on > signal 0 (Unknown signal 0). > -------------------------------------------------------------------------- > 07:35:56 $ showcode > 130 > >> >> short story : i applied pmix.2.patch and that fixed my problem >> could you please review this ? >> >> long story : >> i initially applied pmix.1.patch and it solved my problem >> then i ran >> mpirun -np 1 --mca btl openib,self -host node1 ./abort >> and i came back to square one : exit code is zero >> so i used the debugger and was unable to reproduce the issue >> (one more race condition, yeah !) >> finally, i wrote pmix.2.patch, fixed my issue and realized that >> pmix.1.patch was no more needed. >> currently, and assuming pmix.2.patch is correct, i cannot tell wether >> pmix.1.patch is needed or not >> since this part of the code is no more executed. >> >> i also found one hang with the following trivial program within one node : >> >> int main (int argc, char *argv[]) { >> MPI_Init(&argc, &argv); >> MPI_Finalize(); >> return 3; >> } >> >> from node0 : >> $ mpirun -np 1 ./test >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> >> AND THE PROGRAM HANGS > > This also works fine for me: > > 07:37:27 $ mpirun -n 1 ./mpi_no_op > 07:37:36 $ cat mpi_no_op.c > /* -*- C -*- > * > * $HEADER$ > * > * The most basic of MPI applications > */ > > #include <stdio.h> > #include "mpi.h" > > int main(int argc, char* argv[]) > { > MPI_Init(&argc, &argv); > > MPI_Finalize(); > return 0; > } > > >> >> *but* >> $ mpirun -np 1 -host node1 ./test >> ------------------------------------------------------- >> Primary job terminated normally, but 1 process returned >> a non-zero exit code.. Per user-direction, the job has been aborted. >> ------------------------------------------------------- >> -------------------------------------------------------------------------- >> mpirun detected that one or more processes exited with non-zero status, >> thus causing >> the job to be terminated. The first process to do so was: >> >> Process name: [[22080,1],0] >> Exit code: 3 >> -------------------------------------------------------------------------- >> >> return with exit code 3. > > Likewise here - works just fine for me > > >> >> then i found a strange behaviour with helloworld if only the self btl is >> used : >> $ mpirun -np 1 --mca btl self ./hw >> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3 >> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in >> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at >> line 722 >> >> the program returns with exit code zero, but display an error message. >> >> Cheers, >> >> Gilles >> >> On 2014/08/21 6:21, Ralph Castain wrote: >>> I'm aware of the problem, but it will be fixed when the PMIx branch is >>> merged later this week. >>> >>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet >>> <[email protected]> wrote: >>> >>>> Folks, >>>> >>>> let's look at the following trivial test program : >>>> >>>> #include <mpi.h> >>>> #include <stdio.h> >>>> >>>> int main (int argc, char * argv[]) { >>>> int rank, size; >>>> MPI_Init(&argc, &argv); >>>> MPI_Comm_size(MPI_COMM_WORLD, &size); >>>> MPI_Comm_rank(MPI_COMM_WORLD, &rank); >>>> printf ("I am %d/%d and i abort\n", rank, size); >>>> MPI_Abort(MPI_COMM_WORLD, 2); >>>> printf ("%d/%d aborted !\n", rank, size); >>>> return 3; >>>> } >>>> >>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on >>>> task 1 : >>>> with two tasks or more : >>>> >>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>> with errorcode 2. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> -------------------------------------------------------------------------- >>>> I am 1/2 and i abort >>>> I am 0/2 and i abort >>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt / >>>> mpi-abort >>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>> all help / error messages >>>> >>>> node0 $ echo $? >>>> 0 >>>> >>>> the exit status of mpirun is zero >>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */ >>>> >>>> now if we run only one task : >>>> >>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>>> I am 0/1 and i abort >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>> with errorcode 2. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> -------------------------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun has exited due to process rank 0 with PID 15884 on >>>> node node1 exiting improperly. There are three reasons this could occur: >>>> >>>> 1. this process did not call "init" before exiting, but others in >>>> the job did. This can cause a job to hang indefinitely while it waits >>>> for all processes to call "init". By rule, if one process calls "init", >>>> then ALL processes must call "init" prior to termination. >>>> >>>> 2. this process called "init", but exited without calling "finalize". >>>> By rule, all processes that call "init" MUST call "finalize" prior to >>>> exiting or it will be considered an "abnormal termination" >>>> >>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter >>>> orte_create_session_dirs is set to false. In this case, the run-time cannot >>>> detect that the abort call was an abnormal termination. Hence, the only >>>> error message you will receive is this one. >>>> >>>> This may have caused other processes in the application to be >>>> terminated by signals sent by mpirun (as reported here). >>>> >>>> You can avoid this message by specifying -quiet on the mpirun command line. >>>> >>>> -------------------------------------------------------------------------- >>>> node0 $ echo $? >>>> 1 >>>> >>>> the program displayed a misleading error message and mpirun exited with >>>> error code 1 >>>> /* i would have expected 2, or 3 in the worst case scenario */ >>>> >>>> >>>> i digged it a bit and found a kind of race condition in orted (running >>>> on node 1) >>>> basically, when the process dies, it writes stuff in the openmpi session >>>> directory and exits. >>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to >>>> orted. >>>> on orted, the loss of connection is generally processed before the >>>> SIGCHLD by libevent, >>>> and as a consequence, the exit code is not correctly set (e.g. it is >>>> left to zero). >>>> i did not see any kind of communication between the mpi task and orted >>>> (except writing a file in the openmpi session directory) as i would have >>>> expected >>>> /* but this was just my initial guess, the truth is i do not know what >>>> is supposed to happen */ >>>> >>>> i wrote the attached abort.patch patch to basically get it working. >>>> i highly suspect this is not the right thing to do so i did not commit it. >>>> >>>> it works fine with two tasks or more. >>>> with only one task, mpirun display a misleading error message but the >>>> exit status is ok. >>>> >>>> could someone (Ralph ?) have a look at this ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> >>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort >>>> I am 1/2 and i abort >>>> I am 0/2 and i abort >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>> with errorcode 2. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> -------------------------------------------------------------------------- >>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt / >>>> mpi-abort >>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see >>>> all help / error messages >>>> node0 $ echo $? >>>> 2 >>>> >>>> >>>> >>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort >>>> I am 0/1 and i abort >>>> -------------------------------------------------------------------------- >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>> with errorcode 2. >>>> >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>> You may or may not see output from other processes, depending on >>>> exactly when Open MPI kills them. >>>> -------------------------------------------------------------------------- >>>> ------------------------------------------------------- >>>> Primary job terminated normally, but 1 process returned >>>> a non-zero exit code.. Per user-direction, the job has been aborted. >>>> ------------------------------------------------------- >>>> -------------------------------------------------------------------------- >>>> mpirun detected that one or more processes exited with non-zero status, >>>> thus causing >>>> the job to be terminated. The first process to do so was: >>>> >>>> Process name: [[7955,1],0] >>>> Exit code: 2 >>>> -------------------------------------------------------------------------- >>>> node0 $ echo $? >>>> 2 >>>> >>>> >>>> >>>> <abort.patch>_______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php >> >> <pmix.1.patch><pmix.2.patch>_______________________________________________ >> devel mailing list >> [email protected] >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php > > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/08/15692.php
