Do you know if the rdmacm CPC is really being used for your connection setup (vs other CPCs supported by IB)? Cuz iwarp only supports rdmacm. Maybe that's the difference?

Steve.

On 1/28/2013 1:47 PM, Ralph Castain wrote:
Nope - still works just fine. I didn't receive that warning at all, and it ran 
to completion without problem.

I suspect the problem is that the system I can use just isn't configured like 
yours, and so I can't trigger the problem. Afraid I can't be of help after 
all... :-(


On Jan 28, 2013, at 11:25 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 1/28/2013 12:48 PM, Ralph Castain wrote:
Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.
I still hit it on 1.6.4rc2.

Note iWARP != IB so you may not have this issue on IB systems for various 
reasons.  Did you use the same mpirun line? Namely using this:

--mca btl_openib_ipaddr_include "192.168.170.0/24"

(adjusted to your network config).

Because if I don't use ipaddr_include, then I don't see this issue on my setup.

Also, did you see these logged:

Right after starting the job:

--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           hpc-hn1.ogc.int
  Local device:         cxgb4_0
  Local port:           2
  CPCs attempted:       oob, xoob, rdmacm
--------------------------------------------------------------------------
...

At the end of the job:

[hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / no cpcs for port


I think these are benign, but prolly indicate a bug: the mpirun is restricting 
the job to use port 1 only, so the CPCs shouldn't be attempting port 2...

Steve.


On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 1/28/2013 11:48 AM, Ralph Castain wrote:
On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

On 1/25/2013 12:19 PM, Steve Wise wrote:
Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.
It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.

That definitely sounds like a bug

Anybody have thoughts?  Anybody care? :)
I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.
Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.


Reply via email to