Re: [OMPI devel] [1.8.2rc3] another openib bug (#4377)

2014-08-03 Thread Paul Hargrove
On Sun, Aug 3, 2014 at 12:49 PM, Paul Hargrove  wrote:

> BTW:
> Even with the "ignore_device=1" problem fixed, I can't get btl:openib
> running on x86.
> So, there may be additional reports in the next few hours.
>

That turned out to be the already known issue in 1.8.2rc3 that was since
fixed.
So, with manual application of r32395 + the patch for ticket #4377 I can
run btl:openib on x86+tavor

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] [1.8.2rc3] another openib bug (#4377)

2014-08-03 Thread Paul Hargrove
I have a pair of x86/linux (32 bit) hosts connected by Mellanox Tavor HCAs.
 I have no idea if (or why) this has only appeared on this system, but I
find that blt:openib thinks the INI file says to ignore these HCAs.  See
the 4th line below:


[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/btl_openib_ip.c:364:add_rdma_addr]
Adding addr 172.18.0.105 (0x690012ac) subnet 0xac12 as mthca0:1
[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/btl_openib_ini.c:170:ompi_btl_openib_ini_query]
Querying INI files for vendor 0x02c9, part ID 23108
[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/btl_openib_ini.c:189:ompi_btl_openib_ini_query]
Found corresponding INI values: Mellanox Tavor Infinihost
[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/btl_openib_component.c:1541:init_one_device]
device mthca0 skipped; ignore_device=1
[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/btl_openib_component.c:988:device_destruct]
Failed to release mpool
[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/btl_openib_component.c:1020:device_destruct]
Failed to destroy device resources
[pcp-j-5][[27705,1],0][/home/pcp1/phargrov/OMPI/openmpi-1.8.2rc3-linux-x86-mx/openmpi-1.8.2rc3/ompi/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1981:rdmacm_component_finalize]
rdmacm_component_finalize

Turns out this is known, and has been entered as trac ticket #4377,
currently assigned to miked.
Applying the 2-line patch attached to the ticket fixes the ignore_device=1
problem for me.

Mike,
Please apply that patch to trunk and CMR for 1.8.2

BTW:
Even with the "ignore_device=1" problem fixed, I can't get btl:openib
running on x86.
So, there may be additional reports in the next few hours.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900