Re: [OMPI devel] openib unloaded before last mem dereg

Steve Wise Mon, 28 Jan 2013 14:25:43 -0500

On 1/28/2013 12:48 PM, Ralph Castain wrote:

Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.


Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.


I still hit it on 1.6.4rc2.

Note iWARP != IB so you may not have this issue on IB systems forvarious reasons. Did you use the same mpirun line? Namely using this:


 --mca btl_openib_ipaddr_include "192.168.170.0/24"

(adjusted to your network config).

Because if I don't use ipaddr_include, then I don't see this issue on mysetup.


Also, did you see these logged:

Right after starting the job:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           hpc-hn1.ogc.int
  Local device:         cxgb4_0
  Local port:           2
  CPCs attempted:       oob, xoob, rdmacm
--------------------------------------------------------------------------
...

At the end of the job:

[hpc-hn1.ogc.int:07850] 5 more processes have sent help messagehelp-mpi-btl-openib-cpc-base.txt / no cpcs for port

I think these are benign, but prolly indicate a bug: the mpirun isrestricting the job to use port 1 only, so the CPCs shouldn't beattempting port 2...


Steve.


On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 1/28/2013 11:48 AM, Ralph Castain wrote:

On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.

It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.

That definitely sounds like a bug

Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.

Re: [OMPI devel] openib unloaded before last mem dereg

Reply via email to