I believe I found a bug in openib BTL and just want to see if folks agree with
this. When we are running on a NUMA node and we are bound to a CPU, we only
ant to use the IB device that is closest to us. However, I observed that we
always used both devices regardless. I believe there is a bug in computing the
distances and the below change fixes it. This was introduced with r26391 when
we switched to using hwloc to determine distances. It is a simple error where
we are supposed to be accessing the array with i+j*size.
With this change, we will only use the IB devices that are close to us.
Any comments? Otherwise, I will commit.
Rolf
Index: ompi/mca/btl/openib/btl_openib_component.c
===================================================================
--- ompi/mca/btl/openib/btl_openib_component.c (revision 30175)
+++ ompi/mca/btl/openib/btl_openib_component.c (working copy)
@@ -2202,10 +2202,10 @@
if (NULL != my_obj) {
/* Distance may be asymetrical, so calculate both of them
and take the max */
- a = hwloc_distances->latency[my_obj->logical_index *
+ a = hwloc_distances->latency[my_obj->logical_index +
(ibv_obj->logical_index *
hwloc_distances->nbobjs)];
- b = hwloc_distances->latency[ibv_obj->logical_index *
+ b = hwloc_distances->latency[ibv_obj->logical_index +
(my_obj->logical_index *
hwloc_distances->nbobjs)];
distance = (a > b) ? a : b;
@@ -2224,10 +2224,10 @@
ibv_obj->cpuset,
HWLOC_OBJ_NODE,
++i)) {
- a = hwloc_distances->latency[node_obj->logical_index *
+ a = hwloc_distances->latency[node_obj->logical_index +
(ibv_obj->logical_index *
hwloc_distances->nbobjs)];
- b = hwloc_distances->latency[ibv_obj->logical_index *
+ b = hwloc_distances->latency[ibv_obj->logical_index +
(node_obj->logical_index *
hwloc_distances->nbobjs)];
a = (a > b) ? a : b;
[rvandevaart@drossetti-ivy0 ompi-trunk-gpu-topo]$
-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may
contain
confidential information. Any unauthorized review, use, disclosure or
distribution
is prohibited. If you are not the intended recipient, please contact the
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------