Nope - still works just fine. I didn't receive that warning at all, and it ran 
to completion without problem.

I suspect the problem is that the system I can use just isn't configured like 
yours, and so I can't trigger the problem. Afraid I can't be of help after 
all... :-(


On Jan 28, 2013, at 11:25 AM, Steve Wise <sw...@opengridcomputing.com> wrote:

> On 1/28/2013 12:48 PM, Ralph Castain wrote:
>> Hmmm...afraid I cannot replicate this using the current state of the 1.6 
>> branch (which is the 1.6.4rcN) on the only IB-based cluster I can access.
>> 
>> Can you try it with a 1.6.4 tarball and see if you still see the problem? 
>> Could be someone already fixed it.
> 
> I still hit it on 1.6.4rc2.
> 
> Note iWARP != IB so you may not have this issue on IB systems for various 
> reasons.  Did you use the same mpirun line? Namely using this:
> 
> --mca btl_openib_ipaddr_include "192.168.170.0/24"
> 
> (adjusted to your network config).
> 
> Because if I don't use ipaddr_include, then I don't see this issue on my 
> setup.
> 
> Also, did you see these logged:
> 
> Right after starting the job:
> 
> --------------------------------------------------------------------------
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:           hpc-hn1.ogc.int
>  Local device:         cxgb4_0
>  Local port:           2
>  CPCs attempted:       oob, xoob, rdmacm
> --------------------------------------------------------------------------
> ...
> 
> At the end of the job:
> 
> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
> 
> 
> I think these are benign, but prolly indicate a bug: the mpirun is 
> restricting the job to use port 1 only, so the CPCs shouldn't be attempting 
> port 2...
> 
> Steve.
> 
> 
>> 
>> On Jan 28, 2013, at 10:03 AM, Steve Wise <sw...@opengridcomputing.com> wrote:
>> 
>>> On 1/28/2013 11:48 AM, Ralph Castain wrote:
>>>> On Jan 28, 2013, at 9:12 AM, Steve Wise <sw...@opengridcomputing.com> 
>>>> wrote:
>>>> 
>>>>> On 1/25/2013 12:19 PM, Steve Wise wrote:
>>>>>> Hello,
>>>>>> 
>>>>>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command on 
>>>>>> my chelsio iwarp/rdma setup causes a seg fault every time:
>>>>>> 
>>>>>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 
>>>>>> --mca btl openib,sm,self --mca btl_openib_ipaddr_include 
>>>>>> "192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 
>>>>>> pingpong
>>>>>> 
>>>>>> The segfault is during finalization, and I've debugged this to the point 
>>>>>> were I see a call to dereg_mem() after the openib blt is unloaded via 
>>>>>> dlclose().  dereg_mem() dereferences a function pointer to call the 
>>>>>> btl-specific dereg function, in this case it is openib_dereg_mr().  
>>>>>> However, since that btl has already been unloaded, the deref causes a 
>>>>>> seg fault.  Happens every time with the above mpi job.
>>>>>> 
>>>>>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see 
>>>>>> the seg fault, and I don't see a call to dereg_mem() after the openib 
>>>>>> btl is unloaded.  That's all well good. :)  But I'd like to get this fix 
>>>>>> pushed into 1.6 since that is the current stable release.
>>>>>> 
>>>>>> Question:  Can someone point me to the fix in 1.7?
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Steve.
>>>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called 
>>>>> which unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
>>>>> mca_mpool_base_close() is called which ends up calling dereg_mem() which 
>>>>> seg faults trying to call into the unloaded openib btl.
>>>>> 
>>>> That definitely sounds like a bug
>>>> 
>>>>> Anybody have thoughts?  Anybody care? :)
>>>> I care! It needs to be fixed - I'll take a look. Probably something that 
>>>> forgot to be cmr'd.
>>> Great!  If you want me to try out a fix or gather more debug, just hollar.
>>> 
>>> Thanks,
>>> 
>>> Steve.
>>> 
> 


Reply via email to