Matthias: It looks like this is a duplicate of a reported issue (although with significantly more detail): https://github.com/open-mpi/ompi/issues/1136
> On Feb 3, 2016, at 8:40 PM, Cabral, Matias A <matias.a.cab...@intel.com> > wrote: > > Hi, > > I have hit an issue in which orted hangs during the finalization of a job. > This is reproduced by running 80 ranks per node (yes, oversubscribed) on a 4 > nodes cluster that runs SLES12 with OMPI 1.10.2 (I also see it with 1.10.0). > I found that it is independent of the binary used (I used a very simple > sample to init ranks do nothing and finalize) and also saw happens after > MPI_Finalize(). It is not a deterministic issue and takes a few attempts to > reproduce. When the hang occurs, the mpirun process does not get to wait() > its childs (see below(1)) and is stuck on a poll() (see below (2)). I logged > in the other nodes and found all the “other” orted processes are also held on > a different poll (see below (3)). I found that after attaching gdb to mpirun > and letting it continue the execution finishes with no issues. Same thing > sending a SIGSTOP and SIGCONT the hung mpirun. > > (1) > root 164356 161186 0 16:50 pts/0 00:00:00 mpirun -np 320 > --allow-run-as-root -machinefile ./user/hostfile /scratch/user/osu_multi_lat > root 164358 164356 0 16:50 pts/0 00:00:00 /usr/bin/ssh -x n3 > PATH=/scratch/user/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; > DYLD > root 164359 164356 0 16:50 pts/0 00:00:00 /usr/bin/ssh -x n2 > PATH=/scratch/user/bin:$PATH ; export PATH ; > LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; > DYLD > root 164361 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164362 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164364 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164365 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164366 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164367 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164370 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164372 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164374 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164375 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164378 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > root 164379 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> > …. > > (2) > gdb -p 164356 > … > > Missing separate debuginfos, use: zypper install > glibc-debuginfo-2.19-17.72.x86_64 > (gdb) bt > #0 0x00007f143177a3cd in poll () from /lib64/libc.so.6 > #1 0x00007f14325e0636 in poll_dispatch () from > /scratch/user/lib/libopen-pal.so.13 > #2 0x00007f14325d77bf in opal_libevent2021_event_base_loop () from > /scratch/user/lib/libopen-pal.so.13 > #3 0x00000000004051cd in orterun (argc=7, argv=0x7fff8c4bb428) at > orterun.c:1133 > #4 0x0000000000403a8d in main (argc=7, argv=0x7fff8c4bb428) at main.c:13 > > > (3) (remote nodes orted) > (gdb) bt > #0 0x00007f8c288d33b0 in __poll_nocancel () from /lib64/libc.so.6 > #1 0x00007f8c29941186 in poll_dispatch () /scratch/user/lib/libopen-pal.so.13 > #2 0x00007f8c2993830f in opal_libevent2021_event_base_loop () from > /scratch/user/lib/libopen-pal.so.13 > #3 0x00007f8c29be75c4 in orte_daemon () from > /scratch/user/lib/libopen-rte.so.12 > #4 0x0000000000400827 in main () > > > Thanks, > > _MAC > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/02/18542.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/