Re: [openib-general] Dump and load routes with opensm?

Eitan Zahavi Sat, 06 May 2006 22:28:33 -0700

Hi Greg,

A plan for OpenSM to support loading unicast routes already exists.
To do it we need to develop a scheme where the routes file also holds
the topology using GUIDs
such that the discovered topology is compared to the saved one. I have
outlined the algorithm in
the attached writeup on OpenSM routing algorithms section 6:
"Incremental Algorithms".

The right place to implement this will be to have the osm_db* enhanced
to support this new DB domain.
Then the osm_ucast_mgr.c will need to initialize the DB and use it while
routing.

Regarding dump of existing tables:

If you run opensm -V or -D 0x43 you should get the file /tmp/osm.fdbs
with that dump.
You can also use a more SM independent method for obtaining the tables :
if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs with
similar format.

Eitan

Greg Johnson wrote:
> On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote:
> 
>>Hi Greg,
>>
>>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
>>
>>>Is there currently a way to dump and load routes with opensm?  If
not,
>>>how would I go about writing one?
>>
>>Is it really routes or stable LIDs you want ?
> 
> 
> I actually want routes.  I have queried them with ibtraceroute and
> ibroute, but we need routes for the whole fabric. 
> 
> BTW, if you call ibtraceroute thousands of times it stops working.
> Maybe a problem in the MAD driver?
> 
> 
> 
>>LIDs are stored in /var/cache/osm/guid2lid and restored from there
when
>>OpenSM is started assuming the reassign LIDs option (-r or
>>--reassign_lids) is not used when invoking OpenSM.
> 
> 
> Thanks, that's good to know.
> 
> Greg
> _______________________________________________
> openib-general mailing list
> [email protected]
> http://openib.org/mailman/listinfo/openib-general
> 
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general

              OpenSM Unicast Routing Enhancements for Scalability
             =====================================================

Authors:Eitan Zahavi <[EMAIL PROTECTED]>, Yael Kalka <[EMAIL PROTECTED]>
Date:   Aug 2005.

Table of contents:
1. Overview
2. Notation
3. Current Algorithms
4. Proposed Routing Algorithms
5. Min Hop Tables Implementation
6. Incremental Routing
7. Routing Persistancy

1. Overview:
------------

OpenSM currently uses a two stage routing algorithm for unicast
forwarding tables calculation. As shown later these algorithm are
O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100
nodes cluster. The purpose of this memo is to present the community
the proposed work for enhancing OpenSM routing engine.

2. Notation:
------------
The following notations are used throughout this document:
S = Number of switch devices in the system
P = Number of ports each switch node has
H = Number of HCA ports connected to the fabric
L = Number of HCAs connected to each Leaf switch device. 
    Normal values are 1/2P to 3/4P
D = Fat Tree depth

3. Current Algorithms:
----------------------
OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
of them do not scale with cluster size and can consume both large run
time (minutes) and memory (GB). This section provides meta code for
these algorithms and order calculation.

3.1 Min Hop algorithm analysis:
The Min Hop algorithm is divided into two stages: computation of
min-hop tables on every switch and LFT output port assignment. 

Step 1: Computation of Min-Hop Tables on each switch

The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
2500 switch devices this ends up as 812M-Byte (using LMC=0).

Meta algorithm:
For each HCA mark its remote neighbor switch port with hop 1.
For each switch mark itself port 0 as hop 0
While changes 
  For each switch  
     For each port 
       For each LID 
         Propagate remote port hop as hop +1 if smaller or undefined

The order of this step: O(S*P*(S+H)*(D+1))

Step 2: Assigning output port:
For each switch 
  For each LID 
     For each Port 
        Is it the one with min count

The order of this step: O(S*(S+H)*P)

3.2 Up/Down algorithm analysis
The Up/Down algorithm depends on the ability to rank the fabric nodes
from root to leaf of the tree. To get that ranking it runs a
heuristics that is based on the Min Hop tables. So the memory and 
complexity are identical to the Min-Hop first step to start with.
 
Once ranking is performed the algorithm is BFS from every HCA and fill
in the Min Hop tables again. Up/Down traversal rule is enforced during
the BFS such that only valid turns are allowed.

Meta algorithm:
For each HCA 
  Get connected Switch 
     For each Switch in NextSwitches 
        For each Port 
           Check if direction is OK. Check if not visited

The order is O(H*S*P)

To finalize output port assignment the second step of the Min Hop
algorithm is invoked. 


4. Proposed Algorithm:
----------------------

Inspecting the routing problem we have noticed the following
attributes:
a. Using Min-Hop tables for keeping  intermediate routing information
   has a disadvantage in terms of memory consumption. However, any
   incremental routing algorithm (for handling fabric changes after
   first setup), or routing persistence solution could use this
   information and gain speed.
b. Since we need to fill in LFT tables that are of the order S*(S+H)
   the algorithm is lower bounded by O(H^2).
c. A persistence based solution which uses previously routed fabric
   data and is able to handle simple incremental changes will provide
   a much faster runtime as topology match will require O(S*P)
   (traversing all links once)
d. Since the minimum hops information is identical for a switch and
   all the HCAs connected to it - there is no point in building "min
   hop" tables for HCAs. During the "output port" assignment stage,
   the HCAs connected to each switch are considered and routed.

The result of "a" is that several algorithms that are superior from
memory footprint and skip any "hop table" stage are not considered for
implementation. 

To support "d" we needed to provide an index to each switch such that
the "min hop" tables are dense (previously they were indexed by LID).
The new index is stored on the switch object and thus allow lookup of
a switch "min hop" by its index. An array of switch pointer by index
supports the reverse lookup.

The suggested algorithm is broken into the following 3 stages:
* Root nodes identification heuristics
* Min Hop tables computation
* Output port assignment

4.1 Root nodes identification heuristics:
This step is only required under the AND of the following two
conditions: 
* Up/Down routing is required
* The user does not provide a file with guids of the tree "root nodes" 

This heuristics for recognizing the tree roots is based on histograms
of the HCAs distance from every switch.
I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill
in these histograms on all switches we need to BFS from every leaf
switch and propagate the number of HCAs connected to it:

Meta algorithm:
For each switch 
  For each Port 
    If connected to HCAs count them 
    If any HCA 
     Init BFS to start with current switch 
     set hop count to 0 
     While there are switches in BFS list 
       increment hop count 
       For each switch in BFS List 
         Add the number of HCAs to the histogram at the current hop count 
        For each port 
         If remote port switch not visited 
          Add the switch to the BFS Next Step List 
      Once finished all this step list use next steps

The order of this step is: O(S*P + H/L*S*P) = O(½*H*S)

4.2 Min Hop tables computation:
This step is mandatory and has a slightly different flavor for the
case of Up/Down routing.

The algorithm starts from every leaf switch and traverses BFS wise
through the fabric. 

Meta algorithm:
foreach switch in the fabric
 clear the "Rank" vector for all switches.
 start BFS with the given switch.
 set rank to 0
 while any switch in BFS list
 | foreach switch in BFS switch list
 | |foreach port (valid, active, not unhealthy)
 | | if remote side is a switch: 
 | |  if rank of remote side 0 or = rank + 1
 | |    set the remote port entry MinHopPort for this switch
 | |  if rank of remote side 0 i.e. never visited
 | |    set the remote switch rank to rank + 1
 | |    add the remote switch to next BFS switches
 | |------
 | switch between the current and next switches list
 | increment rank
 |------

The order of this algorithm: O(P*S^2)

Algorithm that merges Up/Down step criteria not yet written for this
stage. But the idea is to make each step keep track of any previous
step down. Such that a step up will be prohibited in this case.

4.3 Output port assignment:
This step provide actual LFT values assignment to each switch.
To do that we access the built "min hop" tables and track port usage.

Meta algorithm:
foreach switch in the fabric
 clear the port subscription vector (track number of paths subscribed)
 foreach target switch in the "min hop" table
  get the list of min-hop ports
  foreach end-node attached (HCA connected to it and itself)
   if lmc > 0 init tracking of remote system and node for selecting
      disjoint paths for same end-node different LID LSBs
   get min-subscribed (and disjoint) port marked min-hop target switch
   track port usage in port-subscription (opt. if target LID is not a switch)

Order of this step: 
Currently the selection of the output port by min-subscription is
trivial and requires O(P) so the overall order is 
O(S*S*(1+L)*P) <= O(16*P*S^2)

One could obtain the list of marked min-hop ports and then use a
modified cyclic list for avoiding the search for min subscription in
the case of LMC > 0. In that case the order could be reduced to:
O(S*S*(1+L)) ~= O(S*H)

5. Min Hop Table Implementation:
--------------------------------
The proposed algorithm does not require storing the number of hops
arriving at the switch port - but only the fact a port is on the min
hop path. This allowed for another memory usage reduction if the min
hop table would be of boolean values. 

The issue then is in an efficiant iterator on the boolean (bits)
array. The tradeoff is thus the common memory versus runtime.

(Anybody knowns off a fast boolean array lookup implementation ?)

6. Incremental Routing:
-----------------------
Once the fabric is routed we can define an algorithm for performing
incremental routing changes. An obvious case is when a link is
declared un-healthy or one of the ports is dropped. Assuming the
recognition of the change is done by some other algorithm. The
following cases apply: 
*       If the link connects HCA and a switch the HCA is unreachable. No
   routing change required. 
*       If the link is between switches:
   * If there at least one another link between these switches:  
     o Spread all routes going through the failing port to the other
       ports connecting to the same switch.
   * If there is no other link to these switches
     o Go back to all switches that feed into each one of the switches
       (feed in means they route some target lids through the switch)
       but only those that route lids that go through the failing port. Check
       to see if there is another port that goes to a different switch
       to route that lid to. If there is no other way do nothing.

How do we support topology changes line moving an HCA from one Switch
to another? 

7. Routing Persistancy:
-----------------------
To make the subnet initialization faster, one could store the existing
routing solution and use it without any calculation. 

The issue is of course what conditions makes the stored routing
obsolete. To maximize the usefullnes of the stored information we
propose to store the Min Hop tables rather then the final port
assignment. It is assumed that after restart there might be a need for
modifications to LMC and routing which will invalidate the LFT
anyways. To enable "cache invalidation criteria" the persistent
database should include information that could be used to easily check
if the fabric was not altered in a way that invalidates the MinHop tables. 

The stored information should hold for each switch in the fabric (by guid)
the list of ports and the guids and port numbers on the remote side. 
To validate there are no significant changes, the discovered set
of switches is checked to match the stored information. Table 1
describes the possible changes and their effect on the  validity of
the MinHop tables. 

Table 1 - Connectivity Changes effect on Routing Info Validity

Change           | Effect on MinHop Tables       | Effect on LFT and MFT   |
-------------------------------------------------------------------------
New Switch found | Invalidates (might connect | Invalidates             |
                 | more HCAs and carries more |                         |
                 | routing resources)         |                         |
-------------------------------------------------------------------------
Missing Switch   | Invalidates (MinHops might | Invalidates             |
                 | be broken a few steps away)|                         |
-------------------------------------------------------------------------
New cable found  | Does not invalidate        | Does not invalidate     |
-------------------------------------------------------------------------
Missing Cable    | Invalidates only if there  | Invalidates all LIDs    |
(SW to SW)       | is no other cable          | going through that port |
                 | connecting the switches    |                         |
-------------------------------------------------------------------------
Missing Cable    | Does no invalidate         | Does not invalidate     |
(SW to HCA)      |                            |                         |
-------------------------------------------------------------------------
New HCA          | Does no invalidate         | Does not invalidate     |
-------------------------------------------------------------------------
Missing HCA      | Does no invalidate         | Does not invalidate     |
-------------------------------------------------------------------------
LID Changes      | Does no invalidate         | Invalidates the modified|
                 |                            | LIDs                    |
-------------------------------------------------------------------------

Special marking for "root nodes" shold cache the results of the first
step for Up/Dpwn routing. These nodes should be invalidated on any
missing or additional switch conditions.

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Dump and load routes with opensm?

Reply via email to