On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
> On 1/25/2013 12:19 PM, Steve Wise wrote: >> Hello, >> >> I'm tracking an issue I see in openmpi-1.6.3. Running this command on my >> chelsio iwarp/rdma setup causes a seg fault every time: >> >> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 >> --mca btl openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" >> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong >> >> The segfault is during finalization, and I've debugged this to the point >> were I see a call to dereg_mem() after the openib blt is unloaded via >> dlclose(). dereg_mem() dereferences a function pointer to call the >> btl-specific dereg function, in this case it is openib_dereg_mr(). However, >> since that btl has already been unloaded, the deref causes a seg fault. >> Happens every time with the above mpi job. >> >> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the >> seg fault, and I don't see a call to dereg_mem() after the openib btl is >> unloaded. That's all well good. :) But I'd like to get this fix pushed >> into 1.6 since that is the current stable release. >> >> Question: Can someone point me to the fix in 1.7? >> >> Thanks, >> >> Steve. > > It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which > unloads the openib btl. Then further down in ompi_mpi_finalize(), > mca_mpool_base_close() is called which ends up calling dereg_mem() which seg > faults trying to call into the unloaded openib btl. > That definitely sounds like a bug > Anybody have thoughts? Anybody care? :) I care! It needs to be fixed - I'll take a look. Probably something that forgot to be cmr'd. Thanks Ralph > > Steve. > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel