Hi Greg,
A plan for OpenSM to support loading unicast routes already exists.
To do it we need to develop a scheme where the routes file also holds
the topology using GUIDs
such that the discovered topology is compared to the saved one. I have
outlined the algorithm in
the attached writeup on OpenSM routing algorithms section 6:
"Incremental Algorithms".
The right place to implement this will be to have the osm_db* enhanced
to support this new DB domain.
Then the osm_ucast_mgr.c will need to initialize the DB and use it while
routing.
Regarding dump of existing tables:
If you run opensm -V or -D 0x43 you should get the file /tmp/osm.fdbs
with that dump.
You can also use a more SM independent method for obtaining the tables :
if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs with
similar format.
Eitan
Greg Johnson wrote:
> On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote:
>
>>Hi Greg,
>>
>>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote:
>>
>>>Is there currently a way to dump and load routes with opensm? If
not,
>>>how would I go about writing one?
>>
>>Is it really routes or stable LIDs you want ?
>
>
> I actually want routes. I have queried them with ibtraceroute and
> ibroute, but we need routes for the whole fabric.
>
> BTW, if you call ibtraceroute thousands of times it stops working.
> Maybe a problem in the MAD driver?
>
>
>
>>LIDs are stored in /var/cache/osm/guid2lid and restored from there
when
>>OpenSM is started assuming the reassign LIDs option (-r or
>>--reassign_lids) is not used when invoking OpenSM.
>
>
> Thanks, that's good to know.
>
> Greg
> _______________________________________________
> openib-general mailing list
> [email protected]
> http://openib.org/mailman/listinfo/openib-general
>
> To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general
OpenSM Unicast Routing Enhancements for Scalability
=====================================================
Authors:Eitan Zahavi <[EMAIL PROTECTED]>, Yael Kalka <[EMAIL PROTECTED]>
Date: Aug 2005.
Table of contents:
1. Overview
2. Notation
3. Current Algorithms
4. Proposed Routing Algorithms
5. Min Hop Tables Implementation
6. Incremental Routing
7. Routing Persistancy
1. Overview:
------------
OpenSM currently uses a two stage routing algorithm for unicast
forwarding tables calculation. As shown later these algorithm are
O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100
nodes cluster. The purpose of this memo is to present the community
the proposed work for enhancing OpenSM routing engine.
2. Notation:
------------
The following notations are used throughout this document:
S = Number of switch devices in the system
P = Number of ports each switch node has
H = Number of HCA ports connected to the fabric
L = Number of HCAs connected to each Leaf switch device.
Normal values are 1/2P to 3/4P
D = Fat Tree depth
3. Current Algorithms:
----------------------
OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both
of them do not scale with cluster size and can consume both large run
time (minutes) and memory (GB). This section provides meta code for
these algorithms and order calculation.
3.1 Min Hop algorithm analysis:
The Min Hop algorithm is divided into two stages: computation of
min-hop tables on every switch and LFT output port assignment.
Step 1: Computation of Min-Hop Tables on each switch
The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with
2500 switch devices this ends up as 812M-Byte (using LMC=0).
Meta algorithm:
For each HCA mark its remote neighbor switch port with hop 1.
For each switch mark itself port 0 as hop 0
While changes
For each switch
For each port
For each LID
Propagate remote port hop as hop +1 if smaller or undefined
The order of this step: O(S*P*(S+H)*(D+1))
Step 2: Assigning output port:
For each switch
For each LID
For each Port
Is it the one with min count
The order of this step: O(S*(S+H)*P)
3.2 Up/Down algorithm analysis
The Up/Down algorithm depends on the ability to rank the fabric nodes
from root to leaf of the tree. To get that ranking it runs a
heuristics that is based on the Min Hop tables. So the memory and
complexity are identical to the Min-Hop first step to start with.
Once ranking is performed the algorithm is BFS from every HCA and fill
in the Min Hop tables again. Up/Down traversal rule is enforced during
the BFS such that only valid turns are allowed.
Meta algorithm:
For each HCA
Get connected Switch
For each Switch in NextSwitches
For each Port
Check if direction is OK. Check if not visited
The order is O(H*S*P)
To finalize output port assignment the second step of the Min Hop
algorithm is invoked.
4. Proposed Algorithm:
----------------------
Inspecting the routing problem we have noticed the following
attributes:
a. Using Min-Hop tables for keeping intermediate routing information
has a disadvantage in terms of memory consumption. However, any
incremental routing algorithm (for handling fabric changes after
first setup), or routing persistence solution could use this
information and gain speed.
b. Since we need to fill in LFT tables that are of the order S*(S+H)
the algorithm is lower bounded by O(H^2).
c. A persistence based solution which uses previously routed fabric
data and is able to handle simple incremental changes will provide
a much faster runtime as topology match will require O(S*P)
(traversing all links once)
d. Since the minimum hops information is identical for a switch and
all the HCAs connected to it - there is no point in building "min
hop" tables for HCAs. During the "output port" assignment stage,
the HCAs connected to each switch are considered and routed.
The result of "a" is that several algorithms that are superior from
memory footprint and skip any "hop table" stage are not considered for
implementation.
To support "d" we needed to provide an index to each switch such that
the "min hop" tables are dense (previously they were indexed by LID).
The new index is stored on the switch object and thus allow lookup of
a switch "min hop" by its index. An array of switch pointer by index
supports the reverse lookup.
The suggested algorithm is broken into the following 3 stages:
* Root nodes identification heuristics
* Min Hop tables computation
* Output port assignment
4.1 Root nodes identification heuristics:
This step is only required under the AND of the following two
conditions:
* Up/Down routing is required
* The user does not provide a file with guids of the tree "root nodes"
This heuristics for recognizing the tree roots is based on histograms
of the HCAs distance from every switch.
I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill
in these histograms on all switches we need to BFS from every leaf
switch and propagate the number of HCAs connected to it:
Meta algorithm:
For each switch
For each Port
If connected to HCAs count them
If any HCA
Init BFS to start with current switch
set hop count to 0
While there are switches in BFS list
increment hop count
For each switch in BFS List
Add the number of HCAs to the histogram at the current hop count
For each port
If remote port switch not visited
Add the switch to the BFS Next Step List
Once finished all this step list use next steps
The order of this step is: O(S*P + H/L*S*P) = O(½*H*S)
4.2 Min Hop tables computation:
This step is mandatory and has a slightly different flavor for the
case of Up/Down routing.
The algorithm starts from every leaf switch and traverses BFS wise
through the fabric.
Meta algorithm:
foreach switch in the fabric
clear the "Rank" vector for all switches.
start BFS with the given switch.
set rank to 0
while any switch in BFS list
| foreach switch in BFS switch list
| |foreach port (valid, active, not unhealthy)
| | if remote side is a switch:
| | if rank of remote side 0 or = rank + 1
| | set the remote port entry MinHopPort for this switch
| | if rank of remote side 0 i.e. never visited
| | set the remote switch rank to rank + 1
| | add the remote switch to next BFS switches
| |------
| switch between the current and next switches list
| increment rank
|------
The order of this algorithm: O(P*S^2)
Algorithm that merges Up/Down step criteria not yet written for this
stage. But the idea is to make each step keep track of any previous
step down. Such that a step up will be prohibited in this case.
4.3 Output port assignment:
This step provide actual LFT values assignment to each switch.
To do that we access the built "min hop" tables and track port usage.
Meta algorithm:
foreach switch in the fabric
clear the port subscription vector (track number of paths subscribed)
foreach target switch in the "min hop" table
get the list of min-hop ports
foreach end-node attached (HCA connected to it and itself)
if lmc > 0 init tracking of remote system and node for selecting
disjoint paths for same end-node different LID LSBs
get min-subscribed (and disjoint) port marked min-hop target switch
track port usage in port-subscription (opt. if target LID is not a switch)
Order of this step:
Currently the selection of the output port by min-subscription is
trivial and requires O(P) so the overall order is
O(S*S*(1+L)*P) <= O(16*P*S^2)
One could obtain the list of marked min-hop ports and then use a
modified cyclic list for avoiding the search for min subscription in
the case of LMC > 0. In that case the order could be reduced to:
O(S*S*(1+L)) ~= O(S*H)
5. Min Hop Table Implementation:
--------------------------------
The proposed algorithm does not require storing the number of hops
arriving at the switch port - but only the fact a port is on the min
hop path. This allowed for another memory usage reduction if the min
hop table would be of boolean values.
The issue then is in an efficiant iterator on the boolean (bits)
array. The tradeoff is thus the common memory versus runtime.
(Anybody knowns off a fast boolean array lookup implementation ?)
6. Incremental Routing:
-----------------------
Once the fabric is routed we can define an algorithm for performing
incremental routing changes. An obvious case is when a link is
declared un-healthy or one of the ports is dropped. Assuming the
recognition of the change is done by some other algorithm. The
following cases apply:
* If the link connects HCA and a switch the HCA is unreachable. No
routing change required.
* If the link is between switches:
* If there at least one another link between these switches:
o Spread all routes going through the failing port to the other
ports connecting to the same switch.
* If there is no other link to these switches
o Go back to all switches that feed into each one of the switches
(feed in means they route some target lids through the switch)
but only those that route lids that go through the failing port. Check
to see if there is another port that goes to a different switch
to route that lid to. If there is no other way do nothing.
How do we support topology changes line moving an HCA from one Switch
to another?
7. Routing Persistancy:
-----------------------
To make the subnet initialization faster, one could store the existing
routing solution and use it without any calculation.
The issue is of course what conditions makes the stored routing
obsolete. To maximize the usefullnes of the stored information we
propose to store the Min Hop tables rather then the final port
assignment. It is assumed that after restart there might be a need for
modifications to LMC and routing which will invalidate the LFT
anyways. To enable "cache invalidation criteria" the persistent
database should include information that could be used to easily check
if the fabric was not altered in a way that invalidates the MinHop tables.
The stored information should hold for each switch in the fabric (by guid)
the list of ports and the guids and port numbers on the remote side.
To validate there are no significant changes, the discovered set
of switches is checked to match the stored information. Table 1
describes the possible changes and their effect on the validity of
the MinHop tables.
Table 1 - Connectivity Changes effect on Routing Info Validity
Change | Effect on MinHop Tables | Effect on LFT and MFT |
-------------------------------------------------------------------------
New Switch found | Invalidates (might connect | Invalidates |
| more HCAs and carries more | |
| routing resources) | |
-------------------------------------------------------------------------
Missing Switch | Invalidates (MinHops might | Invalidates |
| be broken a few steps away)| |
-------------------------------------------------------------------------
New cable found | Does not invalidate | Does not invalidate |
-------------------------------------------------------------------------
Missing Cable | Invalidates only if there | Invalidates all LIDs |
(SW to SW) | is no other cable | going through that port |
| connecting the switches | |
-------------------------------------------------------------------------
Missing Cable | Does no invalidate | Does not invalidate |
(SW to HCA) | | |
-------------------------------------------------------------------------
New HCA | Does no invalidate | Does not invalidate |
-------------------------------------------------------------------------
Missing HCA | Does no invalidate | Does not invalidate |
-------------------------------------------------------------------------
LID Changes | Does no invalidate | Invalidates the modified|
| | LIDs |
-------------------------------------------------------------------------
Special marking for "root nodes" shold cache the results of the first
step for Up/Dpwn routing. These nodes should be invalidated on any
missing or additional switch conditions.
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general