Nope - still works just fine. I didn't receive that warning at all, and it ran to completion without problem.
I suspect the problem is that the system I can use just isn't configured like yours, and so I can't trigger the problem. Afraid I can't be of help after all... :-( On Jan 28, 2013, at 11:25 AM, Steve Wise <sw...@opengridcomputing.com> wrote: > On 1/28/2013 12:48 PM, Ralph Castain wrote: >> Hmmm...afraid I cannot replicate this using the current state of the 1.6 >> branch (which is the 1.6.4rcN) on the only IB-based cluster I can access. >> >> Can you try it with a 1.6.4 tarball and see if you still see the problem? >> Could be someone already fixed it. > > I still hit it on 1.6.4rc2. > > Note iWARP != IB so you may not have this issue on IB systems for various > reasons. Did you use the same mpirun line? Namely using this: > > --mca btl_openib_ipaddr_include "192.168.170.0/24" > > (adjusted to your network config). > > Because if I don't use ipaddr_include, then I don't see this issue on my > setup. > > Also, did you see these logged: > > Right after starting the job: > > -------------------------------------------------------------------------- > No OpenFabrics connection schemes reported that they were able to be > used on a specific port. As such, the openib BTL (OpenFabrics > support) will be disabled for this port. > > Local host: hpc-hn1.ogc.int > Local device: cxgb4_0 > Local port: 2 > CPCs attempted: oob, xoob, rdmacm > -------------------------------------------------------------------------- > ... > > At the end of the job: > > [hpc-hn1.ogc.int:07850] 5 more processes have sent help message > help-mpi-btl-openib-cpc-base.txt / no cpcs for port > > > I think these are benign, but prolly indicate a bug: the mpirun is > restricting the job to use port 1 only, so the CPCs shouldn't be attempting > port 2... > > Steve. > > >> >> On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> wrote: >> >>> On 1/28/2013 11:48 AM, Ralph Castain wrote: >>>> On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> >>>> wrote: >>>> >>>>> On 1/25/2013 12:19 PM, Steve Wise wrote: >>>>>> Hello, >>>>>> >>>>>> I'm tracking an issue I see in openmpi-1.6.3. Running this command on >>>>>> my chelsio iwarp/rdma setup causes a seg fault every time: >>>>>> >>>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 >>>>>> --mca btl openib,sm,self --mca btl_openib_ipaddr_include >>>>>> "192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 >>>>>> pingpong >>>>>> >>>>>> The segfault is during finalization, and I've debugged this to the point >>>>>> were I see a call to dereg_mem() after the openib blt is unloaded via >>>>>> dlclose(). dereg_mem() dereferences a function pointer to call the >>>>>> btl-specific dereg function, in this case it is openib_dereg_mr(). >>>>>> However, since that btl has already been unloaded, the deref causes a >>>>>> seg fault. Happens every time with the above mpi job. >>>>>> >>>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see >>>>>> the seg fault, and I don't see a call to dereg_mem() after the openib >>>>>> btl is unloaded. That's all well good. :) But I'd like to get this fix >>>>>> pushed into 1.6 since that is the current stable release. >>>>>> >>>>>> Question: Can someone point me to the fix in 1.7? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Steve. >>>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called >>>>> which unloads the openib btl. Then further down in ompi_mpi_finalize(), >>>>> mca_mpool_base_close() is called which ends up calling dereg_mem() which >>>>> seg faults trying to call into the unloaded openib btl. >>>>> >>>> That definitely sounds like a bug >>>> >>>>> Anybody have thoughts? Anybody care? :) >>>> I care! It needs to be fixed - I'll take a look. Probably something that >>>> forgot to be cmr'd. >>> Great! If you want me to try out a fix or gather more debug, just hollar. >>> >>> Thanks, >>> >>> Steve. >>> >