Hello,
I'm tracking an issue I see in openmpi-1.6.3. Running this command on
my chelsio iwarp/rdma setup causes a seg fault every time:
/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2
--mca btl openib,sm,self --mca btl_openib_ipaddr_include
"192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1
pingpong
The segfault is during finalization, and I've debugged this to the point
were I see a call to dereg_mem() after the openib blt is unloaded via
dlclose(). dereg_mem() dereferences a function pointer to call the
btl-specific dereg function, in this case it is openib_dereg_mr().
However, since that btl has already been unloaded, the deref causes a
seg fault. Happens every time with the above mpi job.
Now, I tried this same experiment with openmpi-1.7rc6 and I don't see
the seg fault, and I don't see a call to dereg_mem() after the openib
btl is unloaded. That's all well good. :) But I'd like to get this fix
pushed into 1.6 since that is the current stable release.
Question: Can someone point me to the fix in 1.7?
Thanks,
Steve.
The gory details:
Program terminated with signal 11, Segmentation fault.
#0 0x000000343140f807 in ?? () from /lib64/libgcc_s.so.1
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.7.el6.x86_64
li bcxgb4-2.3.0.0-1.el6.x86_64
libgcc-4.4.4-13.el6.x86_64
(gdb) bt
#0 0x000000343140f807 in ?? () from /lib64/libgcc_s.so.1
#1 0x00000034314100b9 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2 0x000000342e4f76ee in backtrace () from /lib64/libc.so.6
#3 0x00007f304d2233ce in opal_backtrace_buffer (message_out=0x7fff4364a0f0,
len_out=0x7fff4364a0fc) at backtrace_execinfo.c:57
#4 0x00007f304d2757ac in show_stackframe (signo=<value optimized out>,
info=<value optimized out>, p=<value optimized out>) at stacktrace.c:347
#5 <signal handler called>
#6 0x00007f304a1c9240 in ?? ()
#7 0x00007f304b48c315 in dereg_mem (mpool=0x233ade0) at mpool_rdma_module.c:87
#8 do_unregistration_gc (mpool=0x233ade0) at mpool_rdma_module.c:140
#9 0x00007f304b48c6cf in mca_mpool_rdma_finalize (mpool=0x233ade0)
at mpool_rdma_module.c:500
#10 0x00007f304d1b4e30 in mca_mpool_base_close () at base/mpool_base_close.c:56
#11 0x00007f304d169705 in ompi_mpi_finalize ()
at runtime/ompi_mpi_finalize.c:402
#12 0x0000000000403802 in main ()
(gdb)
Here's a snipit of stdout with debug printfs in vm_open(), vm_close(),
openib_dereg_mr(), and dereg_mr() showing the btl getting unloaded
before the last dereg call:
dlopen /usr/mpi/gcc/openmpi-1.6.3-dbg/lib/openmpi/mca_btl_openib.so
0x2465b60
...<snip>...
# All processes entering MPI_Finalize
dereg_mem 0x2487f80
openib_dereg_mr 0x24b3d80
dlclose 0x2465b60
dlclose 0x2456030
dlclose 0x2456550
dlclose 0x2467cd0
dlclose 0x24d90e0
dlclose 0x24d7740
dlclose 0x2410680
dlclose 0x2410ef0
dlclose 0x2411610
dlclose 0x248b1d0
dlclose 0x248bf00
dlclose 0x248c8f0
dereg_mem 0x2487f80
[hpc-hn1:05570] *** Process received signal ***
[hpc-hn1:05570] Signal: Segmentation fault (11)
[hpc-hn1:05570] Signal code: Address not mapped (1)
[hpc-hn1:05570] Failing at address: 0x7f335e5f9280