Out of curiosity, could you tell us how you configured OMPI?

On Jan 28, 2013, at 12:46 PM, Steve Wise <sw...@opengridcomputing.com> wrote:

> On 1/28/2013 2:04 PM, Ralph Castain wrote:
>> On Jan 28, 2013, at 11:55 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
>> 
>>> Do you know if the rdmacm CPC is really being used for your connection 
>>> setup (vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  
>>> Maybe that's the difference?
>> Dunno for certain, but I expect it is using the OOB cm since I didn't direct 
>> it to do anything different. Like I said, I suspect the problem is that the 
>> cluster doesn't have iWARP on it.
> 
> Definitely, or it could be the different CPC used for IWvs IB is tickling the 
> issue.
> 
>>> Steve.
>>> 
>>> On 1/28/2013 1:47 PM, Ralph Castain wrote:
>>>> Nope - still works just fine. I didn't receive that warning at all, and it 
>>>> ran to completion without problem.
>>>> 
>>>> I suspect the problem is that the system I can use just isn't configured 
>>>> like yours, and so I can't trigger the problem. Afraid I can't be of help 
>>>> after all... :-(
>>>> 
>>>> 
>>>> On Jan 28, 2013, at 11:25 AM, Steve Wise <sw...@opengridcomputing.com> 
>>>> wrote:
>>>> 
>>>>> On 1/28/2013 12:48 PM, Ralph Castain wrote:
>>>>>> Hmmm...afraid I cannot replicate this using the current state of the 1.6 
>>>>>> branch (which is the 1.6.4rcN) on the only IB-based cluster I can access.
>>>>>> 
>>>>>> Can you try it with a 1.6.4 tarball and see if you still see the 
>>>>>> problem? Could be someone already fixed it.
>>>>> I still hit it on 1.6.4rc2.
>>>>> 
>>>>> Note iWARP != IB so you may not have this issue on IB systems for various 
>>>>> reasons.  Did you use the same mpirun line? Namely using this:
>>>>> 
>>>>> --mca btl_openib_ipaddr_include "192.168.170.0/24"
>>>>> 
>>>>> (adjusted to your network config).
>>>>> 
>>>>> Because if I don't use ipaddr_include, then I don't see this issue on my 
>>>>> setup.
>>>>> 
>>>>> Also, did you see these logged:
>>>>> 
>>>>> Right after starting the job:
>>>>> 
>>>>> --------------------------------------------------------------------------
>>>>> No OpenFabrics connection schemes reported that they were able to be
>>>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>>>> support) will be disabled for this port.
>>>>> 
>>>>>  Local host:           hpc-hn1.ogc.int
>>>>>  Local device:         cxgb4_0
>>>>>  Local port:           2
>>>>>  CPCs attempted:       oob, xoob, rdmacm
>>>>> --------------------------------------------------------------------------
>>>>> ...
>>>>> 
>>>>> At the end of the job:
>>>>> 
>>>>> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
>>>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>>>> 
>>>>> 
>>>>> I think these are benign, but prolly indicate a bug: the mpirun is 
>>>>> restricting the job to use port 1 only, so the CPCs shouldn't be 
>>>>> attempting port 2...
>>>>> 
>>>>> Steve.
>>>>> 
>>>>> 
>>>>>> On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> 
>>>>>> wrote:
>>>>>> 
>>>>>>> On 1/28/2013 11:48 AM, Ralph Castain wrote:
>>>>>>>> On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> On 1/25/2013 12:19 PM, Steve Wise wrote:
>>>>>>>>>> Hello,
>>>>>>>>>> 
>>>>>>>>>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command 
>>>>>>>>>> on my chelsio iwarp/rdma setup causes a seg fault every time:
>>>>>>>>>> 
>>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host 
>>>>>>>>>> hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca 
>>>>>>>>>> btl_openib_ipaddr_include "192.168.170.0/24" 
>>>>>>>>>> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong
>>>>>>>>>> 
>>>>>>>>>> The segfault is during finalization, and I've debugged this to the 
>>>>>>>>>> point were I see a call to dereg_mem() after the openib blt is 
>>>>>>>>>> unloaded via dlclose().  dereg_mem() dereferences a function pointer 
>>>>>>>>>> to call the btl-specific dereg function, in this case it is 
>>>>>>>>>> openib_dereg_mr().  However, since that btl has already been 
>>>>>>>>>> unloaded, the deref causes a seg fault.  Happens every time with the 
>>>>>>>>>> above mpi job.
>>>>>>>>>> 
>>>>>>>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't 
>>>>>>>>>> see the seg fault, and I don't see a call to dereg_mem() after the 
>>>>>>>>>> openib btl is unloaded.  That's all well good. :)  But I'd like to 
>>>>>>>>>> get this fix pushed into 1.6 since that is the current stable 
>>>>>>>>>> release.
>>>>>>>>>> 
>>>>>>>>>> Question:  Can someone point me to the fix in 1.7?
>>>>>>>>>> 
>>>>>>>>>> Thanks,
>>>>>>>>>> 
>>>>>>>>>> Steve.
>>>>>>>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is 
>>>>>>>>> called which unloads the openib btl.  Then further down in 
>>>>>>>>> ompi_mpi_finalize(), mca_mpool_base_close() is called which ends up 
>>>>>>>>> calling dereg_mem() which seg faults trying to call into the unloaded 
>>>>>>>>> openib btl.
>>>>>>>>> 
>>>>>>>> That definitely sounds like a bug
>>>>>>>> 
>>>>>>>>> Anybody have thoughts?  Anybody care? :)
>>>>>>>> I care! It needs to be fixed - I'll take a look. Probably something 
>>>>>>>> that forgot to be cmr'd.
>>>>>>> Great!  If you want me to try out a fix or gather more debug, just 
>>>>>>> hollar.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Steve.
>>>>>>> 
> 


Reply via email to