I think these are fixed now - at least, your test cases all pass for me

On Aug 22, 2014, at 9:12 AM, Ralph Castain <r...@open-mpi.org> wrote:

> 
> On Aug 22, 2014, at 9:06 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> Ralph,
>> 
>> Will do on Monday
>> 
>> About the first test, in my case echo $? returns 0
> 
> My "showcode" is just an alias for the echo
> 
>> I noticed this confusing message in your output :
>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>> signal 0 (Unknown signal 0).
> 
> I'll take a look at why that happened
> 
>> 
>> About the second test, please note my test program return 3;
>> whereas your mpi_no_op.c return 0;
> 
> I didn't see that little cuteness - sigh
> 
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> Ralph Castain <r...@open-mpi.org> wrote:
>> You might want to try again with current head of trunk as something seems 
>> off in what you are seeing - more below
>> 
>> 
>> On Aug 22, 2014, at 3:12 AM, Gilles Gouaillardet 
>> <gilles.gouaillar...@iferc.org> wrote:
>> 
>>> Ralph,
>>> 
>>> i tried again after the merge and found the same behaviour, though the
>>> internals are very different.
>>> 
>>> i run without any batch manager
>>> 
>>> from node0:
>>> mpirun -np 1 --mca btl tcp,self -host node1 ./abort
>>> 
>>> exit with exit code zero :-(
>> 
>> Hmmm...it works fine for me, without your patch:
>> 
>> 07:35:41  $ mpirun -n 1 -mca btl tcp,self -host bend002 ./abort
>> Hello, World, I am 0 of 1
>> --------------------------------------------------------------------------
>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
>> with errorcode 2.
>> 
>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> You may or may not see output from other processes, depending on
>> exactly when Open MPI kills them.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> mpirun noticed that process rank 0 with PID 24382 on node bend002 exited on 
>> signal 0 (Unknown signal 0).
>> --------------------------------------------------------------------------
>> 07:35:56  $ showcode
>> 130
>> 
>>> 
>>> short story : i applied pmix.2.patch and that fixed my problem
>>> could you please review this ?
>>> 
>>> long story :
>>> i initially applied pmix.1.patch and it solved my problem
>>> then i ran
>>> mpirun -np 1 --mca btl openib,self -host node1 ./abort
>>> and i came back to square one : exit code is zero
>>> so i used the debugger and was unable to reproduce the issue
>>> (one more race condition, yeah !)
>>> finally, i wrote pmix.2.patch, fixed my issue and realized that
>>> pmix.1.patch was no more needed.
>>> currently, and assuming pmix.2.patch is correct, i cannot tell wether
>>> pmix.1.patch is needed or not
>>> since this part of the code is no more executed.
>>> 
>>> i also found one hang with the following trivial program within one node :
>>> 
>>> int main (int argc, char *argv[]) {
>>>     MPI_Init(&argc, &argv);
>>>    MPI_Finalize();
>>>    return 3;
>>> }
>>> 
>>> from node0 :
>>> $ mpirun -np 1 ./test
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> 
>>> AND THE PROGRAM HANGS
>> 
>> This also works fine for me:
>> 
>> 07:37:27  $ mpirun -n 1 ./mpi_no_op
>> 07:37:36  $ cat mpi_no_op.c
>> /* -*- C -*-
>>  *
>>  * $HEADER$
>>  *
>>  * The most basic of MPI applications
>>  */
>> 
>> #include <stdio.h>
>> #include "mpi.h"
>> 
>> int main(int argc, char* argv[])
>> {
>>     MPI_Init(&argc, &argv);
>> 
>>     MPI_Finalize();
>>     return 0;
>> }
>> 
>> 
>>> 
>>> *but*
>>> $ mpirun -np 1 -host node1 ./test
>>> -------------------------------------------------------
>>> Primary job  terminated normally, but 1 process returned
>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>> -------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> mpirun detected that one or more processes exited with non-zero status,
>>> thus causing
>>> the job to be terminated. The first process to do so was:
>>> 
>>>  Process name: [[22080,1],0]
>>>  Exit code:    3
>>> --------------------------------------------------------------------------
>>> 
>>> return with exit code 3.
>> 
>> Likewise here - works just fine for me
>> 
>> 
>>> 
>>> then i found a strange behaviour with helloworld if only the self btl is
>>> used :
>>> $ mpirun -np 1 --mca btl self ./hw
>>> [helios91:23319] OPAL dss:unpack: got type 12 when expecting type 3
>>> [helios91:23319] [[22303,0],0] ORTE_ERROR_LOG: Pack data mismatch in
>>> file ../../../src/ompi-trunk/orte/orted/pmix/pmix_server_sendrecv.c at
>>> line 722
>>> 
>>> the program returns with exit code zero, but display an error message.
>>> 
>>> Cheers,
>>> 
>>> Gilles
>>> 
>>> On 2014/08/21 6:21, Ralph Castain wrote:
>>>> I'm aware of the problem, but it will be fixed when the PMIx branch is 
>>>> merged later this week.
>>>> 
>>>> On Aug 19, 2014, at 10:00 PM, Gilles Gouaillardet 
>>>> <gilles.gouaillar...@iferc.org> wrote:
>>>> 
>>>>> Folks,
>>>>> 
>>>>> let's look at the following trivial test program :
>>>>> 
>>>>> #include <mpi.h>
>>>>> #include <stdio.h>
>>>>> 
>>>>> int main (int argc, char * argv[]) {
>>>>>   int rank, size;
>>>>>   MPI_Init(&argc, &argv);
>>>>>   MPI_Comm_size(MPI_COMM_WORLD, &size);
>>>>>   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
>>>>>   printf ("I am %d/%d and i abort\n", rank, size);
>>>>>   MPI_Abort(MPI_COMM_WORLD, 2);
>>>>>   printf ("%d/%d aborted !\n", rank, size);
>>>>>   return 3;
>>>>> }
>>>>> 
>>>>> and let's run mpirun (trunk) on node0 and ask the mpi task to run on
>>>>> task 1 :
>>>>> with two tasks or more :
>>>>> 
>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>>>> --------------------------------------------------------------------------
>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>> with errorcode 2.
>>>>> 
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>> --------------------------------------------------------------------------
>>>>> I am 1/2 and i abort
>>>>> I am 0/2 and i abort
>>>>> [node0:00740] 1 more process has sent help message help-mpi-api.txt /
>>>>> mpi-abort
>>>>> [node0:00740] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>>> all help / error messages
>>>>> 
>>>>> node0 $ echo $?
>>>>> 0
>>>>> 
>>>>> the exit status of mpirun is zero
>>>>> /* this is why the MPI_Errhandler_fatal_c test fails in mtt */
>>>>> 
>>>>> now if we run only one task :
>>>>> 
>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>>>> I am 0/1 and i abort
>>>>> --------------------------------------------------------------------------
>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>> with errorcode 2.
>>>>> 
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>> --------------------------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun has exited due to process rank 0 with PID 15884 on
>>>>> node node1 exiting improperly. There are three reasons this could occur:
>>>>> 
>>>>> 1. this process did not call "init" before exiting, but others in
>>>>> the job did. This can cause a job to hang indefinitely while it waits
>>>>> for all processes to call "init". By rule, if one process calls "init",
>>>>> then ALL processes must call "init" prior to termination.
>>>>> 
>>>>> 2. this process called "init", but exited without calling "finalize".
>>>>> By rule, all processes that call "init" MUST call "finalize" prior to
>>>>> exiting or it will be considered an "abnormal termination"
>>>>> 
>>>>> 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
>>>>> orte_create_session_dirs is set to false. In this case, the run-time 
>>>>> cannot
>>>>> detect that the abort call was an abnormal termination. Hence, the only
>>>>> error message you will receive is this one.
>>>>> 
>>>>> This may have caused other processes in the application to be
>>>>> terminated by signals sent by mpirun (as reported here).
>>>>> 
>>>>> You can avoid this message by specifying -quiet on the mpirun command 
>>>>> line.
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> node0 $ echo $?
>>>>> 1
>>>>> 
>>>>> the program displayed a misleading error message and mpirun exited with
>>>>> error code 1
>>>>> /* i would have expected 2, or 3 in the worst case scenario */
>>>>> 
>>>>> 
>>>>> i digged it a bit and found a kind of race condition in orted (running
>>>>> on node 1)
>>>>> basically, when the process dies, it writes stuff in the openmpi session
>>>>> directory and exits.
>>>>> exiting send a SIGCHLD to orted and close the socket/pipe connected to
>>>>> orted.
>>>>> on orted, the loss of connection is generally processed before the
>>>>> SIGCHLD by libevent,
>>>>> and as a consequence, the exit code is not correctly set (e.g. it is
>>>>> left to zero).
>>>>> i did not see any kind of communication between the mpi task and orted
>>>>> (except writing a file in the openmpi session directory) as i would have
>>>>> expected
>>>>> /* but this was just my initial guess, the truth is i do not know what
>>>>> is supposed to happen */
>>>>> 
>>>>> i wrote the attached abort.patch patch to basically get it working.
>>>>> i highly suspect this is not the right thing to do so i did not commit it.
>>>>> 
>>>>> it works fine with two tasks or more.
>>>>> with only one task, mpirun display a misleading error message but the
>>>>> exit status is ok.
>>>>> 
>>>>> could someone (Ralph ?) have a look at this ?
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Gilles
>>>>> 
>>>>> 
>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 2 ./abort
>>>>> I am 1/2 and i abort
>>>>> I am 0/2 and i abort
>>>>> --------------------------------------------------------------------------
>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>> with errorcode 2.
>>>>> 
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>> --------------------------------------------------------------------------
>>>>> [node0:00920] 1 more process has sent help message help-mpi-api.txt /
>>>>> mpi-abort
>>>>> [node0:00920] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>>>>> all help / error messages
>>>>> node0 $ echo $?
>>>>> 2
>>>>> 
>>>>> 
>>>>> 
>>>>> node0 $ mpirun --mca btl tcp,self -host node1 -np 1 ./abort
>>>>> I am 0/1 and i abort
>>>>> --------------------------------------------------------------------------
>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>> with errorcode 2.
>>>>> 
>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>> You may or may not see output from other processes, depending on
>>>>> exactly when Open MPI kills them.
>>>>> --------------------------------------------------------------------------
>>>>> -------------------------------------------------------
>>>>> Primary job  terminated normally, but 1 process returned
>>>>> a non-zero exit code.. Per user-direction, the job has been aborted.
>>>>> -------------------------------------------------------
>>>>> --------------------------------------------------------------------------
>>>>> mpirun detected that one or more processes exited with non-zero status,
>>>>> thus causing
>>>>> the job to be terminated. The first process to do so was:
>>>>> 
>>>>> Process name: [[7955,1],0]
>>>>> Exit code:    2
>>>>> --------------------------------------------------------------------------
>>>>> node0 $ echo $?
>>>>> 2
>>>>> 
>>>>> 
>>>>> 
>>>>> <abort.patch>_______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15666.php
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/08/15672.php
>>> 
>>> <pmix.1.patch><pmix.2.patch>_______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/08/15689.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/08/15692.php
> 

Reply via email to