It's on the ticket that I just assigned to you. :-)
On Jan 29, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> wrote: > Will do...once I get a patch. > > STeve > On 1/29/2013 7:40 AM, Jeff Squyres (jsquyres) wrote: >> Thanks Josh. >> >> Steve -- if you can confirm that this fixes your problem in the v1.6 series, >> we'll go ahead and commit the patch. >> >> FWIW: the OpenFabrics startup code got a little cleanup/revamp on the >> trunk/v1.7 -- I suspect that's why you're not seeing the problem on >> trunk/v1.7 (e.g., look at the utility routines that were abstracted out to >> ompi/mca/common/verbs). >> >> >> >> On Jan 29, 2013, at 2:41 AM, Joshua Ladd <josh...@mellanox.com> wrote: >> >>> So, we (Mellanox) have observed this ourselves when no suitable CPC can be >>> found. Seems the BTL associated with this port is not destroyed and the ref >>> count is not decreased. Not sure why you don't see the problem in 1.7. But >>> we have a patch that I'll CMR today. Please review our symptoms, diagnosis, >>> and proposed change. Ralph, maybe I can list you as a reviewer of the >>> patch? I've reviewed myself and it looks fine, but wouldn't mind having >>> another set of eyes on it since I don't want to be responsible for breaking >>> the OpenIB BTL. >>> >>> Thanks, >>> >>> Josh Ladd >>> >>> >>> Reported by Yossi: >>> Hi, >>> >>> There is a bug in open mpi (openib component) when one of the active ports >>> is Ethernet. >>> The fix is attached, probably needs to be reviewed and submitted to ompi >>> >>> Error flow: >>> 1. Openib component creates a btl instance for every active port >>> (including Ethernet) >>> 2. Every btl holds a reference count to the device >>> (mca_btl_openib_device_t::btls) >>> 3. Openib tries to create a "connection module" for every btl >>> 4. It fails to create connection module for the Ethernet port >>> 5. The btl for Ethernet port is not returned by openib component, in the >>> list of btl modules >>> 6. The btl for Ethernet port is not destroyed during openib component >>> finalize >>> 7. The device is not destroyed, because of the reference count >>> 8. The memory pool created by the device is not destroyed >>> 9. Later, rdma mpool module cleans up remaining pools during its finalize >>> 10. The memory pool created by openib is destroyed by rdma mpool component >>> finalize >>> 11. The memory pool points to a function (openib_dereg_mr) which is already >>> unloaded from memory (because mca_btl_openib.so was unloaded) >>> 12. Segfault because of a call to invalid function >>> >>> The fix: If a btl module is not going to be returned from openib component >>> init, destroy it. >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On >>> Behalf Of Ralph Castain >>> Sent: Monday, January 28, 2013 8:35 PM >>> To: Steve Wise >>> Cc: Open MPI Developers >>> Subject: Re: [OMPI devel] openib unloaded before last mem dereg >>> >>> Out of curiosity, could you tell us how you configured OMPI? >>> >>> >>> On Jan 28, 2013, at 12:46 PM, Steve Wise <sw...@opengridcomputing.com> >>> wrote: >>> >>>> On 1/28/2013 2:04 PM, Ralph Castain wrote: >>>>> On Jan 28, 2013, at 11:55 AM, Steve Wise <sw...@opengridcomputing.com> >>>>> wrote: >>>>> >>>>>> Do you know if the rdmacm CPC is really being used for your connection >>>>>> setup (vs other CPCs supported by IB)? Cuz iwarp only supports rdmacm. >>>>>> Maybe that's the difference? >>>>> Dunno for certain, but I expect it is using the OOB cm since I didn't >>>>> direct it to do anything different. Like I said, I suspect the problem is >>>>> that the cluster doesn't have iWARP on it. >>>> Definitely, or it could be the different CPC used for IWvs IB is tickling >>>> the issue. >>>> >>>>>> Steve. >>>>>> >>>>>> On 1/28/2013 1:47 PM, Ralph Castain wrote: >>>>>>> Nope - still works just fine. I didn't receive that warning at all, and >>>>>>> it ran to completion without problem. >>>>>>> >>>>>>> I suspect the problem is that the system I can use just isn't >>>>>>> configured like yours, and so I can't trigger the problem. Afraid I >>>>>>> can't be of help after all... :-( >>>>>>> >>>>>>> >>>>>>> On Jan 28, 2013, at 11:25 AM, Steve Wise <sw...@opengridcomputing.com> >>>>>>> wrote: >>>>>>> >>>>>>>> On 1/28/2013 12:48 PM, Ralph Castain wrote: >>>>>>>>> Hmmm...afraid I cannot replicate this using the current state of the >>>>>>>>> 1.6 branch (which is the 1.6.4rcN) on the only IB-based cluster I can >>>>>>>>> access. >>>>>>>>> >>>>>>>>> Can you try it with a 1.6.4 tarball and see if you still see the >>>>>>>>> problem? Could be someone already fixed it. >>>>>>>> I still hit it on 1.6.4rc2. >>>>>>>> >>>>>>>> Note iWARP != IB so you may not have this issue on IB systems for >>>>>>>> various reasons. Did you use the same mpirun line? Namely using this: >>>>>>>> >>>>>>>> --mca btl_openib_ipaddr_include "192.168.170.0/24" >>>>>>>> >>>>>>>> (adjusted to your network config). >>>>>>>> >>>>>>>> Because if I don't use ipaddr_include, then I don't see this issue on >>>>>>>> my setup. >>>>>>>> >>>>>>>> Also, did you see these logged: >>>>>>>> >>>>>>>> Right after starting the job: >>>>>>>> >>>>>>>> ------------------------------------------------------------------ >>>>>>>> -------- No OpenFabrics connection schemes reported that they were >>>>>>>> able to be used on a specific port. As such, the openib BTL >>>>>>>> (OpenFabrics >>>>>>>> support) will be disabled for this port. >>>>>>>> >>>>>>>> Local host: hpc-hn1.ogc.int >>>>>>>> Local device: cxgb4_0 >>>>>>>> Local port: 2 >>>>>>>> CPCs attempted: oob, xoob, rdmacm >>>>>>>> ------------------------------------------------------------------ >>>>>>>> -------- >>>>>>>> ... >>>>>>>> >>>>>>>> At the end of the job: >>>>>>>> >>>>>>>> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message >>>>>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port >>>>>>>> >>>>>>>> >>>>>>>> I think these are benign, but prolly indicate a bug: the mpirun is >>>>>>>> restricting the job to use port 1 only, so the CPCs shouldn't be >>>>>>>> attempting port 2... >>>>>>>> >>>>>>>> Steve. >>>>>>>> >>>>>>>> >>>>>>>>> On Jan 28, 2013, at 10:03 AM, Steve Wise >>>>>>>>> <sw...@opengridcomputing.com> wrote: >>>>>>>>> >>>>>>>>>> On 1/28/2013 11:48 AM, Ralph Castain wrote: >>>>>>>>>>> On Jan 28, 2013, at 9:12 AM, Steve Wise >>>>>>>>>>> <sw...@opengridcomputing.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> On 1/25/2013 12:19 PM, Steve Wise wrote: >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> I'm tracking an issue I see in openmpi-1.6.3. Running this >>>>>>>>>>>>> command on my chelsio iwarp/rdma setup causes a seg fault every >>>>>>>>>>>>> time: >>>>>>>>>>>>> >>>>>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host >>>>>>>>>>>>> hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca >>>>>>>>>>>>> btl_openib_ipaddr_include "192.168.170.0/24" >>>>>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong >>>>>>>>>>>>> >>>>>>>>>>>>> The segfault is during finalization, and I've debugged this to >>>>>>>>>>>>> the point were I see a call to dereg_mem() after the openib blt >>>>>>>>>>>>> is unloaded via dlclose(). dereg_mem() dereferences a function >>>>>>>>>>>>> pointer to call the btl-specific dereg function, in this case it >>>>>>>>>>>>> is openib_dereg_mr(). However, since that btl has already been >>>>>>>>>>>>> unloaded, the deref causes a seg fault. Happens every time with >>>>>>>>>>>>> the above mpi job. >>>>>>>>>>>>> >>>>>>>>>>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't >>>>>>>>>>>>> see the seg fault, and I don't see a call to dereg_mem() after >>>>>>>>>>>>> the openib btl is unloaded. That's all well good. :) But I'd >>>>>>>>>>>>> like to get this fix pushed into 1.6 since that is the current >>>>>>>>>>>>> stable release. >>>>>>>>>>>>> >>>>>>>>>>>>> Question: Can someone point me to the fix in 1.7? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Steve. >>>>>>>>>>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is >>>>>>>>>>>> called which unloads the openib btl. Then further down in >>>>>>>>>>>> ompi_mpi_finalize(), mca_mpool_base_close() is called which ends >>>>>>>>>>>> up calling dereg_mem() which seg faults trying to call into the >>>>>>>>>>>> unloaded openib btl. >>>>>>>>>>>> >>>>>>>>>>> That definitely sounds like a bug >>>>>>>>>>> >>>>>>>>>>>> Anybody have thoughts? Anybody care? :) >>>>>>>>>>> I care! It needs to be fixed - I'll take a look. Probably something >>>>>>>>>>> that forgot to be cmr'd. >>>>>>>>>> Great! If you want me to try out a fix or gather more debug, just >>>>>>>>>> hollar. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Steve. >>>>>>>>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/