Hey Sasha,

While doing some other development, I noticed that some switch ports were
not used in routing even though they were up/healthy.  I wrote a script
(will try to submit to infiniband-diags when I clean it up) that analyzes
dump_lfts to see what ports are used in routing.  Here's an output chunk:

Unbalanced Switch Port Usage: MT47396 Infiniscale-III Mellanox
Technologies, 0x000b8cffff004662, 40
Port 013: 12
Port 014: 12
Port 015: 12
Port 016: 12
Port 017: 12
Port 018: 12
Port 019: 12
Port 020: 12
Port 021: 12
Port 022: 0
Port 023: 11
Port 024: 11

In the above example, Port 022 is not used for routing at all on this
switch.  Naturally, we think this is bad.

After some investigation, I found out that after the initial heavy sweep
is done, some of the ports on some switches are down (I assume hardware
racing during bringup), and thus opensm does not route through those
ports.  When opensm does a heavy resweep later on (I assume b/c some traps
are received when those down ports come up), opensm keeps the same old
forwarding tables from before b/c ignore_existing_lfts is FALSE and b/c
the least hops are the same (other ports on the switch go to the same
parent).  Thus, we get healthy ports not forwarding to a parent switch.

There are multiple ways to deal with this.  I made the attached patch
which solved the problem on one of our test clusters.  It's pretty simple.
 Store all of the "bad ports" that were found during a switch
configuration.  During the next heavy resweep, if some of those "bad
ports" are now up, I set ignore_existing_lfts to TRUE for just that
switch, leading to a completely new forwarding table of the switch.

During my performance testing on this patch, performance with a few
mpibench tests is actually worse by a few percent with this patch.  I am
only using 120 of 144 nodes on this cluster.  It's not a big cluster, has
two levels worth of switches (24 port switches going up to a 288 port
switch.  Yup, the cluster is not "filled out" yet :-).  So there is some
randomness on which specific nodes run the job and if the lid routing
layout is better/worse for that specific set of nodes.

Intuitively, we think this will be better as a whole even though my
current testing can't show it.  Can you think of anything that would make
this patch worse for performance as a whole?  Could you see some side
effect leading to a lot more traffic on the network?

Al

-- 
Albert Chu
[EMAIL PROTECTED]
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




Hey Sasha,

I'm working on some (unrelated to this patch) updn performance patches, when I noticed that sometimes switch ports are not used in routing.  I wrote a script (will try to submit to infiniband-diags when I clean it up) that analyzes dump_lfts to see what ports are used in routing.  Here's an output chunk:

Unbalanced Switch Port Usage: MT47396 Infiniscale-III Mellanox Technologies, 0x000b8cffff004662, 40
Port 013: 12
Port 014: 12
Port 015: 12
Port 016: 12
Port 017: 12
Port 018: 12
Port 019: 12
Port 020: 12
Port 021: 12
Port 022: 0
Port 023: 11
Port 024: 11

In the above, example, Port 022 is simply not used for routing at all on this switch.  Naturally, we think this is bad. 

After some investigation, I found out that sometimes after a heavy sweep is done, some of the ports on the switches are down/not-healthy (I assume hardware racing during bringup), and thus opensm does not route through those ports.  When opensm does a later heavy resweep (I assume b/c some traps are received when those ports come up), the fact that ignore_existing_lfts is not set causes opensm to still not use the "new port".  They keep the same old forwarding tables from before ignore_existing_lfts is FALSE and b/c the least hops are the same.

There are multiple ways to deal with this.  I made the attached patch which solved the problem on one of our test clusters.  It's pretty simple.  Store all of the "bad ports" that were found during a switch configuration.  During the next heavy resweep, if some of those "bad ports" are now up, I set ignore_existing_lfts to TRUE for just that switch and reconfigure the entire switch.

Now, I will admit, that during my performance testing on this patch, performance with a few MPI tests is actually is worse.  My tests are on a system with other users and I am only using 120 of 144 nodes on this cluster.  So there is some potential for slowdown outside of my control.
There is some randomness that I cannot control on which specific nodes run the job, and is the lid routing layout better/worse for that specific set of nodes (unless I run on all 144 nodes, I can't be 100% sure).

Intuitively, we think will be better as a whole.  Can you think of anything that would make this patch bad for performance?

Al

-- 
Albert Chu
[EMAIL PROTECTED]
925-422-5311
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory
>From 123a2709ee2d4befb02e1a785b690b8c72cf6d6b Mon Sep 17 00:00:00 2001
From: Albert L. Chu <[EMAIL PROTECTED](none)>
Date: Wed, 27 Feb 2008 15:40:58 -0800
Subject: [PATCH] Do not ignore existing lfts when new ports exist


Signed-off-by: Albert L. Chu <[EMAIL PROTECTED](none)>
---
 opensm/include/opensm/osm_switch.h |   11 +++++++++
 opensm/opensm/osm_switch.c         |   27 ++++++++++++++++++++++-
 opensm/opensm/osm_ucast_mgr.c      |   41 ++++++++++++++++++++++++++++++++---
 3 files changed, 74 insertions(+), 5 deletions(-)

diff --git a/opensm/include/opensm/osm_switch.h b/opensm/include/opensm/osm_switch.h
index e2fe86d..6682e6f 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -110,6 +110,8 @@ typedef struct _osm_switch {
 	osm_mcast_tbl_t mcast_tbl;
 	uint32_t discovery_count;
 	unsigned need_update;
+	uint8_t *last_routing_bad_ports;
+	unsigned int last_routing_bad_ports_count;
 	void *priv;
 } osm_switch_t;
 /*
@@ -154,6 +156,15 @@ typedef struct _osm_switch {
 *		When set indicates that switch was probably reset, so
 *		fwd tables and rest cached data should be flushed
 *
+*	last_routing_bad_ports
+*		Stores ports that were indicated as bad (port is
+*		down/unhealthy/has a bad remote side) during routing.
+*		Will be used to flag if a switch should ignore
+*		existing forwarding tables or not.
+*
+*	last_routing_bad_ports_count
+*		Number of bad ports stored in last_routing_bad_ports.
+*
 * SEE ALSO
 *	Switch object
 *********/
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index 23429c7..3761850 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -123,6 +123,13 @@ osm_switch_init(IN osm_switch_t * const p_sw,
 	for (port_num = 0; port_num < num_ports; port_num++)
 		osm_port_prof_construct(&p_sw->p_prof[port_num]);
 
+	p_sw->last_routing_bad_ports = malloc(sizeof(uint8_t) * num_ports);
+	if (p_sw->last_routing_bad_ports == NULL) {
+		status = IB_INSUFFICIENT_MEMORY;
+		goto Exit;
+	}
+	p_sw->last_routing_bad_ports_count = 0;
+
       Exit:
 	return (status);
 }
@@ -143,6 +150,8 @@ void osm_switch_delete(IN OUT osm_switch_t ** const pp_sw)
 				free(p_sw->hops[i]);
 		free(p_sw->hops);
 	}
+	if (p_sw->last_routing_bad_ports)
+		free(p_sw->last_routing_bad_ports);
 	free(*pp_sw);
 	*pp_sw = NULL;
 }
@@ -265,6 +274,8 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw,
 	uint8_t best_port_other_sys = 0;
 	uint8_t best_port_other_node = 0;
 	boolean_t port_found = FALSE;
+	unsigned int *p_indx;
+	boolean_t found = FALSE;
 	osm_physp_t *p_physp;
 	osm_physp_t *p_rem_physp;
 	osm_node_t *p_rem_node;
@@ -363,8 +374,22 @@ osm_switch_recommend_path(IN const osm_switch_t * const p_sw,
 		       we require all - non sma ports to be linked
 		       to be routed through
 		     */
-		    !osm_physp_get_remote(p_physp))
+		    !osm_physp_get_remote(p_physp)) {
+			/* Use pointer to remove 'const' warnings */
+			p_indx = (unsigned int *)&(p_sw->last_routing_bad_ports_count);
+			found = FALSE;
+			for (i = 0; i < (*p_indx); i++) {
+				if (p_sw->last_routing_bad_ports[i] == port_num) {
+					found = TRUE;
+					break;
+				}
+			}
+			if (!found) {
+				p_sw->last_routing_bad_ports[*p_indx] = port_num;
+				(*p_indx)++;
+			}
 			continue;
+		}
 
 		/*
 		   We located a least-hop port, possibly one of many.
diff --git a/opensm/opensm/osm_ucast_mgr.c b/opensm/opensm/osm_ucast_mgr.c
index 88e29e9..e77e6bb 100644
--- a/opensm/opensm/osm_ucast_mgr.c
+++ b/opensm/opensm/osm_ucast_mgr.c
@@ -197,7 +197,8 @@ __osm_ucast_mgr_process_neighbor(IN osm_ucast_mgr_t * const p_mgr,
 static void
 __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr,
 			     IN osm_switch_t * const p_sw,
-			     IN osm_port_t * const p_port)
+			     IN osm_port_t * const p_port,
+			     IN boolean_t ignore_existing_lfts)
 {
 	uint16_t min_lid_ho;
 	uint16_t max_lid_ho;
@@ -281,7 +282,6 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr,
 		/* Use the enhanced algorithm only for LMC > 0 */
 		if (lids_per_port > 1)
 			port = osm_switch_recommend_path(p_sw, p_port, lid_ho,
-							 p_mgr->p_subn->
 							 ignore_existing_lfts,
 							 p_mgr->is_dor,
 							 remote_sys_guids,
@@ -290,7 +290,6 @@ __osm_ucast_mgr_process_port(IN osm_ucast_mgr_t * const p_mgr,
 							 &num_used_nodes);
 		else
 			port = osm_switch_recommend_path(p_sw, p_port, lid_ho,
-							 p_mgr->p_subn->
 							 ignore_existing_lfts,
 							 p_mgr->is_dor,
 							 NULL, NULL, NULL,
@@ -497,6 +496,7 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * const p_map_item,
 	osm_node_t *p_node;
 	osm_port_t *p_port;
 	const cl_qmap_t *p_port_tbl;
+	boolean_t ignore_existing_lfts;
 
 	OSM_LOG_ENTER(p_mgr->p_log, __osm_ucast_mgr_process_tbl);
 
@@ -515,6 +515,38 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * const p_map_item,
 	/* Initialize LIDs in buffer to invalid port number. */
 	memset(p_mgr->lft_buf, 0xff, IB_LID_UCAST_END_HO + 1);
 
+	/* Determine ignore_existing_lfts
+	 *
+	 * We don't want to ignore existing lfts if previous routing
+	 * did not use a switch port and that port is now available. 
+	 * Without this, routing can become very unbalanced, as some 
+	 * ports will be used far more heavily than other ports.  For 
+	 * example, during previous routing, port X was down and was 
+	 * never used.  During a later routing, port X is now up.  Port X 
+	 * may still never be used if we don't ignore existing lfts.
+	 * This can occur if other ports on the switch offered
+	 * equivalent hops to a destination lid.
+	 */
+	ignore_existing_lfts = p_mgr->p_subn->ignore_existing_lfts;
+	if (p_sw->last_routing_bad_ports_count) {
+		osm_physp_t *p_physp;
+		uint8_t port_num;
+		int i;
+		for (i = 0; i < p_sw->last_routing_bad_ports_count; i++) {
+			port_num = p_sw->last_routing_bad_ports[i];
+			p_physp = osm_node_get_physp_ptr(p_sw->p_node, port_num);
+			if (p_physp
+			    && osm_physp_is_healthy(p_physp)
+			    && osm_physp_get_remote(p_physp)) {
+				ignore_existing_lfts = TRUE;
+				break;
+			}
+		}
+	}
+
+	/* Reset last_routing_bad_ports */
+	p_sw->last_routing_bad_ports_count = 0;
+
 	p_port_tbl = &p_mgr->p_subn->port_guid_tbl;
 
 	/*
@@ -525,7 +557,8 @@ __osm_ucast_mgr_process_tbl(IN cl_map_item_t * const p_map_item,
 	for (p_port = (osm_port_t *) cl_qmap_head(p_port_tbl);
 	     p_port != (osm_port_t *) cl_qmap_end(p_port_tbl);
 	     p_port = (osm_port_t *) cl_qmap_next(&p_port->map_item)) {
-		__osm_ucast_mgr_process_port(p_mgr, p_sw, p_port);
+		__osm_ucast_mgr_process_port(p_mgr, p_sw, p_port,
+					     ignore_existing_lfts);
 	}
 
 	osm_ucast_mgr_set_fwd_table(p_mgr, p_sw);
-- 
1.5.1
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to