Ralph,

i noted several hangs in mtt with the v1.8 branch.

a simple way to reproduce it is to use the MPI_Errhandler_fatal_f test
from the intel_tests suite,
invoke mpirun on one node and run the taks on an other node :

node0$ mpirun -np 3 -host node1 --mca btl tcp,self ./MPI_Errhandler_fatal_f

/* since this is a race condition, you might need to run this in a loop
in order to hit the bug */

the attached tarball contains a patch (add debug + temporary hack) and
some log files obtained with
--mca errmgr_base_verbose 100 --mca odls_base_verbose 100

without the hack, i can reproduce the bug with -np 3 (log.ko.txt) , with
the hack, i can still reproduce the hang (though it might
be a different one) with -np 16 (log.ko.2.txt)

i remember some similar hangs were fixed on the trunk/master a few
monthes ago.
i tried to backport some commits but it did not help :-(

could you please have a look at this ?

Cheers,

Gilles

Attachment: abort_hang.tar.gz
Description: application/gzip

Reply via email to