Rolf,

i faced a bit different problem, but that is 100% reproductible :
- i launch mpirun (no batch manager) from a node with one IB port
- i use -host node01,node02 where node01 and node02 both have two IB port
on the
  same subnet

by default, this will hang.
if this is a "feature" (e.g. openmpi does not support this kind of
configuration) i am fine with it.

when i run mpirun --mca btl_openib_if_exclude mlx4_1, then if the
application is a success, then it works just fine.

if the application calls MPI_Abort() /* and even if all tasks call
MPI_Abort() */ then it will hang 100% of the time.
i do not see that as a feature but as a bug.

in an other thread, Jeff mentionned that the usnic btl is doing stuff even
if there is no usnic hardware (this will be fixed shortly).
Do you still see intermittent hang without listing usnic as a btl ?

Cheers,

Gilles



On Fri, May 30, 2014 at 12:11 AM, Rolf vandeVaart <rvandeva...@nvidia.com>
wrote:

> Ralph:
>
> I am seeing cases where mpirun seems to hang when one of the applications
> exits with non-zero.  For example, the intel test MPI_Cart_get_c will exit
> that way if there are not enough processes to run the test.  In most cases,
> mpirun seems to return fine with the error code, but sometimes it just
> hangs.   I first started noticing this in my mtt runs.  It seems (but not
> conclusive) that I see this when both the usnic and openib are built, even
> though I am only using the openib (as I have no usnic hardware).
>
>
>
> Anyone else seeing something like this?  Note that I see this on both 1.8
> and trunk, but I show trunk here.
>

Reply via email to