Hi, I have hit an issue in which orted hangs during the finalization of a job. This is reproduced by running 80 ranks per node (yes, oversubscribed) on a 4 nodes cluster that runs SLES12 with OMPI 1.10.2 (I also see it with 1.10.0). I found that it is independent of the binary used (I used a very simple sample to init ranks do nothing and finalize) and also saw happens after MPI_Finalize(). It is not a deterministic issue and takes a few attempts to reproduce. When the hang occurs, the mpirun process does not get to wait() its childs (see below(1)) and is stuck on a poll() (see below (2)). I logged in the other nodes and found all the "other" orted processes are also held on a different poll (see below (3)). I found that after attaching gdb to mpirun and letting it continue the execution finishes with no issues. Same thing sending a SIGSTOP and SIGCONT the hung mpirun.
(1) root 164356 161186 0 16:50 pts/0 00:00:00 mpirun -np 320 --allow-run-as-root -machinefile ./user/hostfile /scratch/user/osu_multi_lat root 164358 164356 0 16:50 pts/0 00:00:00 /usr/bin/ssh -x n3 PATH=/scratch/user/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD root 164359 164356 0 16:50 pts/0 00:00:00 /usr/bin/ssh -x n2 PATH=/scratch/user/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD root 164361 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164362 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164364 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164365 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164366 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164367 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164370 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164372 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164374 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164375 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164378 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> root 164379 164356 0 16:50 pts/0 00:00:06 [osu_multi_lat] <defunct> .... (2) gdb -p 164356 ... Missing separate debuginfos, use: zypper install glibc-debuginfo-2.19-17.72.x86_64 (gdb) bt #0 0x00007f143177a3cd in poll () from /lib64/libc.so.6 #1 0x00007f14325e0636 in poll_dispatch () from /scratch/user/lib/libopen-pal.so.13 #2 0x00007f14325d77bf in opal_libevent2021_event_base_loop () from /scratch/user/lib/libopen-pal.so.13 #3 0x00000000004051cd in orterun (argc=7, argv=0x7fff8c4bb428) at orterun.c:1133 #4 0x0000000000403a8d in main (argc=7, argv=0x7fff8c4bb428) at main.c:13 (3) (remote nodes orted) (gdb) bt #0 0x00007f8c288d33b0 in __poll_nocancel () from /lib64/libc.so.6 #1 0x00007f8c29941186 in poll_dispatch () /scratch/user/lib/libopen-pal.so.13 #2 0x00007f8c2993830f in opal_libevent2021_event_base_loop () from /scratch/user/lib/libopen-pal.so.13 #3 0x00007f8c29be75c4 in orte_daemon () from /scratch/user/lib/libopen-rte.so.12 #4 0x0000000000400827 in main () Thanks, _MAC