Matthias:

It looks like this is a duplicate of a reported issue (although with 
significantly more detail): https://github.com/open-mpi/ompi/issues/1136

> On Feb 3, 2016, at 8:40 PM, Cabral, Matias A <matias.a.cab...@intel.com> 
> wrote:
> 
> Hi, 
>  
> I have hit an issue in which orted hangs during the finalization of a job. 
> This is reproduced by running 80 ranks per node (yes, oversubscribed) on a 4 
> nodes cluster that runs SLES12 with OMPI 1.10.2 (I also see it with 1.10.0). 
> I found that it is independent of the binary used (I used a very simple 
> sample to init ranks do nothing and finalize) and also saw happens after 
> MPI_Finalize(). It is not a deterministic issue and takes a few attempts to 
> reproduce. When the hang occurs, the mpirun process does not get to wait() 
> its childs (see below(1)) and is stuck on a poll() (see below (2)). I logged 
> in the other nodes and found all the “other” orted processes are also held on 
> a different poll (see below (3)).  I found that after attaching gdb to mpirun 
> and letting it continue the execution finishes with no issues. Same thing 
> sending a SIGSTOP and SIGCONT the hung mpirun.
>  
> (1)
> root     164356 161186  0 16:50 pts/0    00:00:00 mpirun -np 320 
> --allow-run-as-root -machinefile ./user/hostfile /scratch/user/osu_multi_lat
> root     164358 164356  0 16:50 pts/0    00:00:00 /usr/bin/ssh -x n3     
> PATH=/scratch/user/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
> DYLD
> root     164359 164356  0 16:50 pts/0    00:00:00 /usr/bin/ssh -x n2     
> PATH=/scratch/user/bin:$PATH ; export PATH ; 
> LD_LIBRARY_PATH=/scratch/user/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; 
> DYLD
> root     164361 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164362 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164364 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164365 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164366 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164367 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164370 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164372 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164374 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164375 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164378 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> root     164379 164356  0 16:50 pts/0    00:00:06 [osu_multi_lat] <defunct>
> ….
>  
> (2)
> gdb -p 164356
> …
>  
> Missing separate debuginfos, use: zypper install 
> glibc-debuginfo-2.19-17.72.x86_64
> (gdb) bt
> #0  0x00007f143177a3cd in poll () from /lib64/libc.so.6
> #1  0x00007f14325e0636 in poll_dispatch () from 
> /scratch/user/lib/libopen-pal.so.13
> #2  0x00007f14325d77bf in opal_libevent2021_event_base_loop () from 
> /scratch/user/lib/libopen-pal.so.13
> #3  0x00000000004051cd in orterun (argc=7, argv=0x7fff8c4bb428) at 
> orterun.c:1133
> #4  0x0000000000403a8d in main (argc=7, argv=0x7fff8c4bb428) at main.c:13
>  
>  
> (3) (remote nodes orted)
> (gdb) bt
> #0  0x00007f8c288d33b0 in __poll_nocancel () from /lib64/libc.so.6
> #1  0x00007f8c29941186 in poll_dispatch () /scratch/user/lib/libopen-pal.so.13
> #2  0x00007f8c2993830f in opal_libevent2021_event_base_loop () from 
> /scratch/user/lib/libopen-pal.so.13
> #3  0x00007f8c29be75c4 in orte_daemon () from 
> /scratch/user/lib/libopen-rte.so.12
> #4  0x0000000000400827 in main ()
>  
>  
> Thanks,
>  
> _MAC
>  
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/02/18542.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to