Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.


On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

> On 1/28/2013 11:48 AM, Ralph Castain wrote:
>> On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
>> 
>>> On 1/25/2013 12:19 PM, Steve Wise wrote:
>>>> Hello,
>>>> 
>>>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
>>>> chelsio iwarp/rdma setup causes a seg fault every time:
>>>> 
>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 
>>>> --mca btl openib,sm,self --mca btl_openib_ipaddr_include 
>>>> "192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 
>>>> pingpong
>>>> 
>>>> The segfault is during finalization, and I've debugged this to the point 
>>>> were I see a call to dereg_mem() after the openib blt is unloaded via 
>>>> dlclose().  dereg_mem() dereferences a function pointer to call the 
>>>> btl-specific dereg function, in this case it is openib_dereg_mr().  
>>>> However, since that btl has already been unloaded, the deref causes a seg 
>>>> fault.  Happens every time with the above mpi job.
>>>> 
>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the 
>>>> seg fault, and I don't see a call to dereg_mem() after the openib btl is 
>>>> unloaded.  That's all well good. :)  But I'd like to get this fix pushed 
>>>> into 1.6 since that is the current stable release.
>>>> 
>>>> Question:  Can someone point me to the fix in 1.7?
>>>> 
>>>> Thanks,
>>>> 
>>>> Steve.
>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called 
>>> which unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
>>> mca_mpool_base_close() is called which ends up calling dereg_mem() which 
>>> seg faults trying to call into the unloaded openib btl.
>>> 
>> That definitely sounds like a bug
>> 
>>> Anybody have thoughts?  Anybody care? :)
>> I care! It needs to be fixed - I'll take a look. Probably something that 
>> forgot to be cmr'd.
> 
> Great!  If you want me to try out a fix or gather more debug, just hollar.
> 
> Thanks,
> 
> Steve.
> 


Reply via email to