Hi Al,

On 20:17 Thu 28 Feb     , Albert Chu wrote:
> 
> After some investigation, I found out that after the initial heavy sweep
> is done, some of the ports on some switches are down (I assume hardware
> racing during bringup), and thus opensm does not route through those
> ports.  When opensm does a heavy resweep later on (I assume b/c some traps
> are received when those down ports come up), opensm keeps the same old
> forwarding tables from before b/c ignore_existing_lfts is FALSE and b/c
> the least hops are the same (other ports on the switch go to the same
> parent).  Thus, we get healthy ports not forwarding to a parent switch.

I see the problem. Actually I think it is even worse - for example if new
switch(es) is connected to a fabric routing will not be rebalanced on
existing ones.

> There are multiple ways to deal with this.  I made the attached patch
> which solved the problem on one of our test clusters.  It's pretty simple.
>  Store all of the "bad ports" that were found during a switch
> configuration.  During the next heavy resweep, if some of those "bad
> ports" are now up, I set ignore_existing_lfts to TRUE for just that
> switch, leading to a completely new forwarding table of the switch.

Why to not keep is_bad flag on osm_physp_t itself - it would save some
comparison loops?

Hmm, thinking more about this - currently we are tracking port state
migrations to INIT during subnet discovery. It is to keep port tables
up to date. I think it could be used for 'ignore_exsting_lfts' update as
well. Something like this (not tested):

diff --git a/opensm/include/opensm/osm_switch.h 
b/opensm/include/opensm/osm_switch.h
index e2fe86d..567ff6f 100644
--- a/opensm/include/opensm/osm_switch.h
+++ b/opensm/include/opensm/osm_switch.h
@@ -110,6 +110,7 @@ typedef struct _osm_switch {
        osm_mcast_tbl_t mcast_tbl;
        uint32_t discovery_count;
        unsigned need_update;
+       unsigned ignore_existing_lfts;
        void *priv;
 } osm_switch_t;
 /*
diff --git a/opensm/opensm/osm_port_info_rcv.c 
b/opensm/opensm/osm_port_info_rcv.c
index ecac2a8..a1b547e 100644
--- a/opensm/opensm/osm_port_info_rcv.c
+++ b/opensm/opensm/osm_port_info_rcv.c
@@ -316,6 +316,9 @@ __osm_pi_rcv_process_switch_port(IN osm_sm_t * sm,
 
        if (ib_port_info_get_port_state(p_pi) > IB_LINK_INIT && p_node->sw)
                p_node->sw->need_update = 0;
+       
+       if (p_physp->need_update)
+               p_node->sw->ignore_existing_lfts = 1;
 
        if (port_num == 0)
                pi_rcv_check_and_fix_lid(sm->p_log, p_pi, p_physp);
diff --git a/opensm/opensm/osm_state_mgr.c b/opensm/opensm/osm_state_mgr.c
index 38b2c4e..dec1d0a 100644
--- a/opensm/opensm/osm_state_mgr.c
+++ b/opensm/opensm/osm_state_mgr.c
@@ -148,6 +148,7 @@ __osm_state_mgr_reset_switch_count(IN cl_map_item_t * const 
p_map_item,
 
        p_sw->discovery_count = 0;
        p_sw->need_update = 1;
+       p_sw->ignore_existing_lfts = 0;
 }
 
 /**********************************************************************
diff --git a/opensm/opensm/osm_switch.c b/opensm/opensm/osm_switch.c
index d74cb6c..67223e5 100644
--- a/opensm/opensm/osm_switch.c
+++ b/opensm/opensm/osm_switch.c
@@ -101,6 +101,7 @@ osm_switch_init(IN osm_switch_t * const p_sw,
        p_sw->switch_info = *p_si;
        p_sw->num_ports = num_ports;
        p_sw->need_update = 1;
+       p_sw->ignore_existing_lfts = 1;
 
        status = osm_fwd_tbl_init(&p_sw->fwd_tbl, p_si);
        if (status != IB_SUCCESS)
@@ -303,7 +304,7 @@ osm_switch_recommend_path(IN const osm_switch_t * const 
p_sw,
           3. the physical port has a remote port (the link is up)
           4. the port has min-hops to the target (avoid loops)
         */
-       if (!ignore_existing) {
+       if (!ignore_existing && !p_sw->ignore_existing_lfts) {
                port_num = osm_fwd_tbl_get(&p_sw->fwd_tbl, lid_ho);
 
                if (port_num != OSM_NO_PATH) {


Here I added 'ignore_existing_lfts' flag per switch too. What do you
think?

Regardless to this it also could be useful to add to the console a
command to set p_subn->ignore_existing_lfts up manually.

> During my performance testing on this patch, performance with a few
> mpibench tests is actually worse by a few percent with this patch.  I am
> only using 120 of 144 nodes on this cluster.  It's not a big cluster, has
> two levels worth of switches (24 port switches going up to a 288 port
> switch.  Yup, the cluster is not "filled out" yet :-).  So there is some
> randomness on which specific nodes run the job and if the lid routing
> layout is better/worse for that specific set of nodes.
> 
> Intuitively, we think this will be better as a whole even though my
> current testing can't show it.  Can you think of anything that would make
> this patch worse for performance as a whole?  Could you see some side
> effect leading to a lot more traffic on the network?

Hmm, interesting... Are you running mpibench during heavy sweep? If so
could the degradation be due to a fact of path migration and potential
packet drops?

Sasha
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to