Hi,

I have hit an issue in which orted hangs during the finalization of a job. This 
is reproduced by running 80 ranks per node (yes, oversubscribed) on a 4 nodes 
cluster that runs SLES12 with OMPI 1.10.2 (I also see it with 1.10.0). I found 
that it is independent of the binary used (I used a very simple sample to init 
ranks do nothing and finalize) and also saw happens after MPI_Finalize(). It is 
not a deterministic issue and takes a few attempts to reproduce. When the hang 
occurs, the mpirun process does not get to wait() its childs (see below(1)) and 
is stuck on a poll() (see below (2)). I logged in the other nodes and found all 
the "other" orted processes are also held on a different poll (see below (3)).  
I found that after attaching gdb to mpirun and letting it continue the 
execution finishes with no issues. Same thing sending a SIGSTOP and SIGCONT the 
hung mpirun.

(1)
root     164356 161186  0 16:50 pts/0    00:00:00 mpirun -np 320 
--allow-run-as-root -machinefile ./user/hostfile /scratch/user/osu_multi_lat
root     164358 164356  0 16:50 pts/0    00:00:00 /usr/bin/ssh -x n3     
PATH=/scratch/user/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
DYLD
root     164359 164356  0 16:50 pts/0    00:00:00 /usr/bin/ssh -x n2     
PATH=/scratch/user/bin:$PATH ; export PATH ; 
LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
DYLD
root     164361 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164362 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164364 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164365 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164366 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164367 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164370 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164372 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164374 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164375 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164378 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
root     164379 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
....

(2)
gdb -p 164356
...

Missing separate debuginfos, use: zypper install 
glibc-debuginfo-2.19-17.72.x86_64
(gdb) bt
#0  0x00007f143177a3cd in poll () from /lib64/libc.so.6
#1  0x00007f14325e0636 in poll_dispatch () from 
/scratch/user/lib/libopen-pal.so.13
#2  0x00007f14325d77bf in opal_libevent2021_event_base_loop () from 
/scratch/user/lib/libopen-pal.so.13
#3  0x00000000004051cd in orterun (argc=7, argv=0x7fff8c4bb428) at 
orterun.c:1133
#4  0x0000000000403a8d in main (argc=7, argv=0x7fff8c4bb428) at main.c:13


(3) (remote nodes orted)
(gdb) bt
#0  0x00007f8c288d33b0 in __poll_nocancel () from /lib64/libc.so.6
#1  0x00007f8c29941186 in poll_dispatch () /scratch/user/lib/libopen-pal.so.13
#2  0x00007f8c2993830f in opal_libevent2021_event_base_loop () from 
/scratch/user/lib/libopen-pal.so.13
#3  0x00007f8c29be75c4 in orte_daemon () from 
/scratch/user/lib/libopen-rte.so.12
#4  0x0000000000400827 in main ()


Thanks,

_MAC

Reply via email to