Ralph:
I am seeing cases where mpirun seems to hang when one of the applications exits 
with non-zero.  For example, the intel test MPI_Cart_get_c will exit that way 
if there are not enough processes to run the test.  In most cases, mpirun seems 
to return fine with the error code, but sometimes it just hangs.   I first 
started noticing this in my mtt runs.  It seems (but not conclusive) that I see 
this when both the usnic and openib are built, even though I am only using the 
openib (as I have no usnic hardware).

Anyone else seeing something like this?  Note that I see this on both 1.8 and 
trunk, but I show trunk here.


PASS:
[rvandevaart@drossetti-ivy0 src]$ mpirun --mca btl self,sm,usnic,openib --host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 --mca 
btl_openib_warn_default_gid_prefix 0 MPI_Cart_get_c
MPITEST skip (1): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST info  (0): Starting MPI_Cart_get  test
MPITEST skip (0): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (3): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (2): WARNING --  nodes =   4   Need   6 nodes to run test
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned a non-zero exit code.. 
Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing the job to be terminated. The first process to do so was:

  Process name: [[45854,1],1]
  Exit code:    77
--------------------------------------------------------------------------

FAIL:
[rvandevaart@drossetti-ivy0 src]$ mpirun --mca btl self,sm,usnic,openib --host 
drossetti-ivy0,drossetti-ivy0,drossetti-ivy1,drossetti-ivy1 -np 4 --mca 
btl_openib_warn_default_gid_prefix 0 MPI_Cart_get_c
MPITEST skip (1): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST info  (0): Starting MPI_Cart_get  test
MPITEST skip (0): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (3): WARNING --  nodes =   4   Need   6 nodes to run test
MPITEST skip (2): WARNING --  nodes =   4   Need   6 nodes to run test
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned a non-zero exit code.. 
Per user-direction, the job has been aborted.
-------------------------------------------------------
[...now we are hung...]

LOCAL mpirun:
[rvandevaart@drossetti-ivy0 64-mtt-nocuda]$ pstack 27705 Thread 2 (Thread 
0x7fe0c8c47700 (LWP 27706)):
#0  0x00007fe0ca578533 in select () from /lib64/libc.so.6
#1  0x00007fe0c8c5591e in listen_thread () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/openmpi/mca_oob_tcp.so
#2  0x00007fe0ca831851 in start_thread () from /lib64/libpthread.so.0
#3  0x00007fe0ca57f94d in clone () from /lib64/libc.so.6 Thread 1 (Thread 
0x7fe0cbcdd700 (LWP 27705)):
#0  0x00007fe0ca576293 in poll () from /lib64/libc.so.6
#1  0x00007fe0cb589575 in poll_dispatch () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#2  0x00007fe0cb57df8c in opal_libevent2021_event_base_loop () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#3  0x0000000000405572 in orterun ()
#4  0x0000000000403904 in main ()
[rvandevaart@drossetti-ivy0 64-mtt-nocuda]$

REMOTE ORTED:
[rvandevaart@drossetti-ivy1 ~]$ pstack 10241
#0  0x00007fbdcba7c258 in poll () from /lib64/libc.so.6
#1  0x00007fbdcca8f575 in poll_dispatch () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#2  0x00007fbdcca83f8c in opal_libevent2021_event_base_loop () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-pal.so.0
#3  0x00007fbdccd572cc in orte_daemon () from 
/geppetto/home/rvandevaart/ompi/ompi-trunk-reduction-new/64-mtt-nocuda/lib/libopen-rte.so.0
#4  0x000000000040094a in main ()
[rvandevaart@drossetti-ivy1 ~]$


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

Reply via email to