On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

> On 1/25/2013 12:19 PM, Steve Wise wrote:
>> Hello,
>> 
>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
>> chelsio iwarp/rdma setup causes a seg fault every time:
>> 
>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 
>> --mca btl openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
>> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong
>> 
>> The segfault is during finalization, and I've debugged this to the point 
>> were I see a call to dereg_mem() after the openib blt is unloaded via 
>> dlclose().  dereg_mem() dereferences a function pointer to call the 
>> btl-specific dereg function, in this case it is openib_dereg_mr().  However, 
>> since that btl has already been unloaded, the deref causes a seg fault.  
>> Happens every time with the above mpi job.
>> 
>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the 
>> seg fault, and I don't see a call to dereg_mem() after the openib btl is 
>> unloaded.  That's all well good. :)  But I'd like to get this fix pushed 
>> into 1.6 since that is the current stable release.
>> 
>> Question:  Can someone point me to the fix in 1.7?
>> 
>> Thanks,
>> 
>> Steve.
> 
> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
> unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
> mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
> faults trying to call into the unloaded openib btl.
> 

That definitely sounds like a bug

> Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Thanks
Ralph

> 
> Steve.
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to