Out of curiosity, could you tell us how you configured OMPI?
On Jan 28, 2013, at 12:46 PM, Steve Wise <sw...@opengridcomputing.com> wrote: > On 1/28/2013 2:04 PM, Ralph Castain wrote: >> On Jan 28, 2013, at 11:55 AM, Steve Wise <sw...@opengridcomputing.com> wrote: >> >>> Do you know if the rdmacm CPC is really being used for your connection >>> setup (vs other CPCs supported by IB)? Cuz iwarp only supports rdmacm. >>> Maybe that's the difference? >> Dunno for certain, but I expect it is using the OOB cm since I didn't direct >> it to do anything different. Like I said, I suspect the problem is that the >> cluster doesn't have iWARP on it. > > Definitely, or it could be the different CPC used for IWvs IB is tickling the > issue. > >>> Steve. >>> >>> On 1/28/2013 1:47 PM, Ralph Castain wrote: >>>> Nope - still works just fine. I didn't receive that warning at all, and it >>>> ran to completion without problem. >>>> >>>> I suspect the problem is that the system I can use just isn't configured >>>> like yours, and so I can't trigger the problem. Afraid I can't be of help >>>> after all... :-( >>>> >>>> >>>> On Jan 28, 2013, at 11:25 AM, Steve Wise <sw...@opengridcomputing.com> >>>> wrote: >>>> >>>>> On 1/28/2013 12:48 PM, Ralph Castain wrote: >>>>>> Hmmm...afraid I cannot replicate this using the current state of the 1.6 >>>>>> branch (which is the 1.6.4rcN) on the only IB-based cluster I can access. >>>>>> >>>>>> Can you try it with a 1.6.4 tarball and see if you still see the >>>>>> problem? Could be someone already fixed it. >>>>> I still hit it on 1.6.4rc2. >>>>> >>>>> Note iWARP != IB so you may not have this issue on IB systems for various >>>>> reasons. Did you use the same mpirun line? Namely using this: >>>>> >>>>> --mca btl_openib_ipaddr_include "192.168.170.0/24" >>>>> >>>>> (adjusted to your network config). >>>>> >>>>> Because if I don't use ipaddr_include, then I don't see this issue on my >>>>> setup. >>>>> >>>>> Also, did you see these logged: >>>>> >>>>> Right after starting the job: >>>>> >>>>> -------------------------------------------------------------------------- >>>>> No OpenFabrics connection schemes reported that they were able to be >>>>> used on a specific port. As such, the openib BTL (OpenFabrics >>>>> support) will be disabled for this port. >>>>> >>>>> Local host: hpc-hn1.ogc.int >>>>> Local device: cxgb4_0 >>>>> Local port: 2 >>>>> CPCs attempted: oob, xoob, rdmacm >>>>> -------------------------------------------------------------------------- >>>>> ... >>>>> >>>>> At the end of the job: >>>>> >>>>> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message >>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port >>>>> >>>>> >>>>> I think these are benign, but prolly indicate a bug: the mpirun is >>>>> restricting the job to use port 1 only, so the CPCs shouldn't be >>>>> attempting port 2... >>>>> >>>>> Steve. >>>>> >>>>> >>>>>> On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> >>>>>> wrote: >>>>>> >>>>>>> On 1/28/2013 11:48 AM, Ralph Castain wrote: >>>>>>>> On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> On 1/25/2013 12:19 PM, Steve Wise wrote: >>>>>>>>>> Hello, >>>>>>>>>> >>>>>>>>>> I'm tracking an issue I see in openmpi-1.6.3. Running this command >>>>>>>>>> on my chelsio iwarp/rdma setup causes a seg fault every time: >>>>>>>>>> >>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host >>>>>>>>>> hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca >>>>>>>>>> btl_openib_ipaddr_include "192.168.170.0/24" >>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong >>>>>>>>>> >>>>>>>>>> The segfault is during finalization, and I've debugged this to the >>>>>>>>>> point were I see a call to dereg_mem() after the openib blt is >>>>>>>>>> unloaded via dlclose(). dereg_mem() dereferences a function pointer >>>>>>>>>> to call the btl-specific dereg function, in this case it is >>>>>>>>>> openib_dereg_mr(). However, since that btl has already been >>>>>>>>>> unloaded, the deref causes a seg fault. Happens every time with the >>>>>>>>>> above mpi job. >>>>>>>>>> >>>>>>>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't >>>>>>>>>> see the seg fault, and I don't see a call to dereg_mem() after the >>>>>>>>>> openib btl is unloaded. That's all well good. :) But I'd like to >>>>>>>>>> get this fix pushed into 1.6 since that is the current stable >>>>>>>>>> release. >>>>>>>>>> >>>>>>>>>> Question: Can someone point me to the fix in 1.7? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Steve. >>>>>>>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is >>>>>>>>> called which unloads the openib btl. Then further down in >>>>>>>>> ompi_mpi_finalize(), mca_mpool_base_close() is called which ends up >>>>>>>>> calling dereg_mem() which seg faults trying to call into the unloaded >>>>>>>>> openib btl. >>>>>>>>> >>>>>>>> That definitely sounds like a bug >>>>>>>> >>>>>>>>> Anybody have thoughts? Anybody care? :) >>>>>>>> I care! It needs to be fixed - I'll take a look. Probably something >>>>>>>> that forgot to be cmr'd. >>>>>>> Great! If you want me to try out a fix or gather more debug, just >>>>>>> hollar. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Steve. >>>>>>> >