Yevgeny Kliteynik wrote:
Hi Nicolas,

Nicolas Morey Chaisemartin wrote:
Hello,

While doing some routing analysis on fat tree using ibsim we found a "bug" in the fat-tree algorithm.
Problem happens with a 4 level Fat tree as below:


                         L3  L3
       ___________________|__|____________________
/ / \ \ <= All the L2 are connected on 2 L3 switches
   L2-1         L2-2            L2-1           L2-2
/ / \ \ <== The Nth L1 of a set leads only to the Nth L2 (L2-N). With some pruning.
  L1           L1                 L1             L1
  /|\         /|\                 /|\           /|\
==Fully mixed to L1== ==Fully mixed to L1== <=== We have multiple set. In each set, all L0 lead to all L1 of their set.

   L0           L0                 L0           L0
 /   \        /    \             /    \       /     \
CN    CN  .. CN    CN    ....   CN    CN  .. CN    CN


To detail:
We have a bunch of sets. Each set contains compute node, L0 and L1 switches.
Plus a common top of L2 and L3 switches.

In each set, there are groups of compute nodes. Each group is connected to a single L0 switch.
In a given set, all L0 are connected to all L1.

The Nth L1 of a set is connected to the Nth L2 and only to this one. (so through a L2, the Nth L1 can only see the Nth L1 of the other sets)
All the L2 are connected to a couple of L3.


If we dont put the L3. We have a perfectly equilibrated fat tree and well equilibrated routes. But when we add the L3, it introduce a huge difference. As it is not necessary, no route is going through L3 (which is fine). However 1/4 of L2->L1 routes is not used at all, 1/2 is half used and 1/4 is twice overused (compared to the equilibrate state).

This comes from the down_port_groups_idx which is incremented each time the algorithm goes down through a node whether it creates routes to HCA (port != switch) or not. As route coming up from a L1 reaches only one L2, the algorithm goes through all the other L2 while going down, incrementing their index. Our case here is a bit specific but in a case where your L1 doesn't have full connectivity to all your L2, and another switch rank above, the problem may appear.

To avoid this problem, I've changed the __osm_ftree_fabric_route_upgoing_by_going_down function so it returns a value to indicate if routes to HCA (in fact to leaf switch) were created. With this information, we only increase the index when the algorithm has created routes to HCA. After applying this patch and measuring the link usage, we are at perfect equilibrium (L2<->L3 links are still not used but that is to be expected).

Great! I've actually seen this problem on a real clusters, but
couldn't understand what's cusing the lack of equilibrity.

See couple of questions below.

Signed-off-by: Nicolas Morey-Chaisemartin <[email protected]>
---
 opensm/opensm/osm_ucast_ftree.c |   23 ++++++++++++++---------
 1 files changed, 14 insertions(+), 9 deletions(-)

------------------------------------------------------------------------

diff --git a/opensm/opensm/osm_ucast_ftree.c b/opensm/opensm/osm_ucast_ftree.c
index ebe6612..3474876 100644
--- a/opensm/opensm/osm_ucast_ftree.c
+++ b/opensm/opensm/osm_ucast_ftree.c
@@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, * assign-up-going-port-by-descending-down to r-port node (recursion)
  */
-static void
+static int
__osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
                            IN ftree_sw_t * p_sw,
                            IN ftree_sw_t * p_prev_sw,
@@ -1932,21 +1932,23 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
     uint16_t i;
     uint16_t j;
     uint16_t k;
+    uint8_t created_route=0;
/* we shouldn't enter here if both real_lid and main_path are false */
     CL_ASSERT(is_real_lid || is_main_path);
/* if there is no down-going ports */
     if (p_sw->down_port_groups_num == 0)
-        return;
+        return 1;

Shouldn't it return 0?
Probably yes. I was thinking to the case where (taking notations from my scheme above) a L0 wouldn't have any CN (beaucse they are shutdown, broken, or for future extension). In this case, I think it'll smooth things a bit and not desequilibrate the network.
-    /* promote the index that indicates which group should we
-       start with when going through all the downgoing groups */
-    p_sw->down_port_groups_idx =
-        (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+ /* If we are on a leaf switch we should be creating routes for real HCA */ + /* This flag will be returned so upper layers will incrementent shift index */
+    if(p_sw->is_leaf == TRUE){
+        created_route=1;
+    }

The "is_leaf" flag will be TRUE only on leaf switches that have CNs connected to them. If we want to solve the problem for all routes (CNs, IO nodes, management nodes),
the "created_route" flag should be updated elsewhere (see below).
I was not aware of this, so yes it should be done somewhere else (though it can also be done here).

     /* foreach down-going port group (in indexing order) */
-    i = p_sw->down_port_groups_idx;
+    i = (p_sw->down_port_groups_idx + 1) %  p_sw->down_port_groups_num;
     for (k = 0; k < p_sw->down_port_groups_num; k++) {

I think that since p_sw->down_port_groups_idx is promoted below,
there is no need to increase the starting value of i.

I tried to but I had some problem (segfault probably due to the %p_sw->down_port_groups_num).
If it works without incrementing, it's fine with me.
+    if(created_route)
+ p_sw->down_port_groups_idx = + (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
...
+    return created_route;
 }                /* __osm_ftree_fabric_route_upgoing_by_going_down() */
/***************************************************/

How about something like this:

@@ -1914,7 +1914,7 @@ static void __osm_ftree_set_sw_fwd_table(IN cl_map_item_t * const p_map_item, * assign-up-going-port-by-descending-down to r-port node (recursion)
  */

-static void
+static boolean_t
__osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
                            IN ftree_sw_t * p_sw,
                            IN ftree_sw_t * p_prev_sw,
@@ -1932,18 +1932,14 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
     uint16_t i;
     uint16_t j;
     uint16_t k;
+    boolean_t created_route = FALSE;

/* we shouldn't enter here if both real_lid and main_path are false */
     CL_ASSERT(is_real_lid || is_main_path);

     /* if there is no down-going ports */
     if (p_sw->down_port_groups_num == 0)
-        return;
-
-    /* promote the index that indicates which group should we
-       start with when going through all the downgoing groups */
-    p_sw->down_port_groups_idx =
-        (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+        return FALSE;

     /* foreach down-going port group (in indexing order) */
     i = p_sw->down_port_groups_idx;
@@ -1952,9 +1948,12 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,
         p_group = p_sw->down_port_groups[i];
         i = (i + 1) % p_sw->down_port_groups_num;

-        /* Skip this port group unless it points to a switch */
-        if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH)
+        /* If this port group doesn't point to a switch, mark
+           that the route was created and skip to the next group */
+        if (p_group->remote_node_type != IB_NODE_TYPE_SWITCH) {
+            created_route = TRUE;
             continue;
+        }

         if (p_prev_sw
             && (p_group->remote_base_lid == p_prev_sw->base_lid)) {
@@ -2073,16 +2072,25 @@ __osm_ftree_fabric_route_upgoing_by_going_down(IN ftree_fabric_t * p_ftree,

         /* Recursion step:
Assign upgoing ports by stepping down, starting on REMOTE switch */ - __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ - NULL, /* prev. position - NULL to mark that we went down and not up */ - target_lid, /* LID that we're routing to */ - target_rank, /* rank of the LID that we're routing to */ - is_real_lid, /* whether the target LID is real or dummy */ - is_main_path, /* whether this is path to HCA that should by tracked by counters */ - highest_rank_in_route); /* highest visited point in the tree before going down */ + created_route |= __osm_ftree_fabric_route_upgoing_by_going_down(p_ftree, + p_remote_sw, /* remote switch - used as a route-upgoing alg. start point */ + NULL, /* prev. position - NULL to mark that we went down and not up */
+            target_lid,    /* LID that we're routing to */
+            target_rank,    /* rank of the LID that we're routing to */
+ is_real_lid, /* whether the target LID is real or dummy */ + is_main_path, /* whether this is path to HCA that should by tracked by counters */ + highest_rank_in_route); /* highest visited point in the tree before going down */
     }
     /* done scanning all the down-going port groups */

+    /* if the route was created, promote the index that
+       indicates which group should we start with when
+       going through all the downgoing groups */
+    if (created_route)
+        p_sw->down_port_groups_idx =
+ (p_sw->down_port_groups_idx + 1) % p_sw->down_port_groups_num;
+
+    return created_route;
 }                /* __osm_ftree_fabric_route_upgoing_by_going_down() */

 /***************************************************/







That seems good.
I'm going to think a bit more about the case where there are no downports.

Best regards

Nicolas

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to