Hi Eitan, On Sun, 2006-05-07 at 01:31, Eitan Zahavi wrote: > Hi Greg, > > A plan for OpenSM to support loading unicast routes already exists.
What's the timeframe for implementing this ? Has the implementation already started ? Is it what you write below and had posted to the list last summer ? > To do it we need to develop a scheme where the routes file also holds > the topology using GUIDs > such that the discovered topology is compared to the saved one. I have > outlined the algorithm in > the attached writeup on OpenSM routing algorithms section 6: > "Incremental Algorithms". Wouldn't it be better to separate out the persistency and incremental issues ? Also, I sent numerous questions and comments on this writeup some time ago. > The right place to implement this will be to have the osm_db* enhanced > to support this new DB domain. > Then the osm_ucast_mgr.c will need to initialize the DB and use it while > routing. Also, changes versus this either need disabling or handling. -- Hal > Regarding dump of existing tables: > > If you run opensm -V or -D 0x43 you should get the file /tmp/osm.fdbs > with that dump. > You can also use a more SM independent method for obtaining the tables : > if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs with > similar format. > > Eitan > > Greg Johnson wrote: > > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote: > > > >>Hi Greg, > >> > >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote: > >> > >>>Is there currently a way to dump and load routes with opensm? If > not, > >>>how would I go about writing one? > >> > >>Is it really routes or stable LIDs you want ? > > > > > > I actually want routes. I have queried them with ibtraceroute and > > ibroute, but we need routes for the whole fabric. > > > > BTW, if you call ibtraceroute thousands of times it stops working. > > Maybe a problem in the MAD driver? > > > > > > > >>LIDs are stored in /var/cache/osm/guid2lid and restored from there > when > >>OpenSM is started assuming the reassign LIDs option (-r or > >>--reassign_lids) is not used when invoking OpenSM. > > > > > > Thanks, that's good to know. > > > > Greg > > _______________________________________________ > > openib-general mailing list > > [email protected] > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > http://openib.org/mailman/listinfo/openib-general > > > > > ______________________________________________________________________ > > OpenSM Unicast Routing Enhancements for Scalability > ===================================================== > > Authors:Eitan Zahavi <[EMAIL PROTECTED]>, Yael Kalka <[EMAIL PROTECTED]> > Date: Aug 2005. > > Table of contents: > 1. Overview > 2. Notation > 3. Current Algorithms > 4. Proposed Routing Algorithms > 5. Min Hop Tables Implementation > 6. Incremental Routing > 7. Routing Persistancy > > 1. Overview: > ------------ > > OpenSM currently uses a two stage routing algorithm for unicast > forwarding tables calculation. As shown later these algorithm are > O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on 1100 > nodes cluster. The purpose of this memo is to present the community > the proposed work for enhancing OpenSM routing engine. > > 2. Notation: > ------------ > The following notations are used throughout this document: > S = Number of switch devices in the system > P = Number of ports each switch node has > H = Number of HCA ports connected to the fabric > L = Number of HCAs connected to each Leaf switch device. > Normal values are 1/2P to 3/4P > D = Fat Tree depth > > 3. Current Algorithms: > ---------------------- > OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both > of them do not scale with cluster size and can consume both large run > time (minutes) and memory (GB). This section provides meta code for > these algorithms and order calculation. > > 3.1 Min Hop algorithm analysis: > The Min Hop algorithm is divided into two stages: computation of > min-hop tables on every switch and LFT output port assignment. > > Step 1: Computation of Min-Hop Tables on each switch > > The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with > 2500 switch devices this ends up as 812M-Byte (using LMC=0). > > Meta algorithm: > For each HCA mark its remote neighbor switch port with hop 1. > For each switch mark itself port 0 as hop 0 > While changes > For each switch > For each port > For each LID > Propagate remote port hop as hop +1 if smaller or undefined > > The order of this step: O(S*P*(S+H)*(D+1)) > > Step 2: Assigning output port: > For each switch > For each LID > For each Port > Is it the one with min count > > The order of this step: O(S*(S+H)*P) > > 3.2 Up/Down algorithm analysis > The Up/Down algorithm depends on the ability to rank the fabric nodes > from root to leaf of the tree. To get that ranking it runs a > heuristics that is based on the Min Hop tables. So the memory and > complexity are identical to the Min-Hop first step to start with. > > Once ranking is performed the algorithm is BFS from every HCA and fill > in the Min Hop tables again. Up/Down traversal rule is enforced during > the BFS such that only valid turns are allowed. > > Meta algorithm: > For each HCA > Get connected Switch > For each Switch in NextSwitches > For each Port > Check if direction is OK. Check if not visited > > The order is O(H*S*P) > > To finalize output port assignment the second step of the Min Hop > algorithm is invoked. > > > 4. Proposed Algorithm: > ---------------------- > > Inspecting the routing problem we have noticed the following > attributes: > a. Using Min-Hop tables for keeping intermediate routing information > has a disadvantage in terms of memory consumption. However, any > incremental routing algorithm (for handling fabric changes after > first setup), or routing persistence solution could use this > information and gain speed. > b. Since we need to fill in LFT tables that are of the order S*(S+H) > the algorithm is lower bounded by O(H^2). > c. A persistence based solution which uses previously routed fabric > data and is able to handle simple incremental changes will provide > a much faster runtime as topology match will require O(S*P) > (traversing all links once) > d. Since the minimum hops information is identical for a switch and > all the HCAs connected to it - there is no point in building "min > hop" tables for HCAs. During the "output port" assignment stage, > the HCAs connected to each switch are considered and routed. > > The result of "a" is that several algorithms that are superior from > memory footprint and skip any "hop table" stage are not considered for > implementation. > > To support "d" we needed to provide an index to each switch such that > the "min hop" tables are dense (previously they were indexed by LID). > The new index is stored on the switch object and thus allow lookup of > a switch "min hop" by its index. An array of switch pointer by index > supports the reverse lookup. > > The suggested algorithm is broken into the following 3 stages: > * Root nodes identification heuristics > * Min Hop tables computation > * Output port assignment > > 4.1 Root nodes identification heuristics: > This step is only required under the AND of the following two > conditions: > * Up/Down routing is required > * The user does not provide a file with guids of the tree "root nodes" > > This heuristics for recognizing the tree roots is based on histograms > of the HCAs distance from every switch. > I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to fill > in these histograms on all switches we need to BFS from every leaf > switch and propagate the number of HCAs connected to it: > > Meta algorithm: > For each switch > For each Port > If connected to HCAs count them > If any HCA > Init BFS to start with current switch > set hop count to 0 > While there are switches in BFS list > increment hop count > For each switch in BFS List > Add the number of HCAs to the histogram at the current hop count > For each port > If remote port switch not visited > Add the switch to the BFS Next Step List > Once finished all this step list use next steps > > The order of this step is: O(S*P + H/L*S*P) = O(*H*S) > > 4.2 Min Hop tables computation: > This step is mandatory and has a slightly different flavor for the > case of Up/Down routing. > > The algorithm starts from every leaf switch and traverses BFS wise > through the fabric. > > Meta algorithm: > foreach switch in the fabric > clear the "Rank" vector for all switches. > start BFS with the given switch. > set rank to 0 > while any switch in BFS list > | foreach switch in BFS switch list > | |foreach port (valid, active, not unhealthy) > | | if remote side is a switch: > | | if rank of remote side 0 or = rank + 1 > | | set the remote port entry MinHopPort for this switch > | | if rank of remote side 0 i.e. never visited > | | set the remote switch rank to rank + 1 > | | add the remote switch to next BFS switches > | |------ > | switch between the current and next switches list > | increment rank > |------ > > The order of this algorithm: O(P*S^2) > > Algorithm that merges Up/Down step criteria not yet written for this > stage. But the idea is to make each step keep track of any previous > step down. Such that a step up will be prohibited in this case. > > 4.3 Output port assignment: > This step provide actual LFT values assignment to each switch. > To do that we access the built "min hop" tables and track port usage. > > Meta algorithm: > foreach switch in the fabric > clear the port subscription vector (track number of paths subscribed) > foreach target switch in the "min hop" table > get the list of min-hop ports > foreach end-node attached (HCA connected to it and itself) > if lmc > 0 init tracking of remote system and node for selecting > disjoint paths for same end-node different LID LSBs > get min-subscribed (and disjoint) port marked min-hop target switch > track port usage in port-subscription (opt. if target LID is not a switch) > > Order of this step: > Currently the selection of the output port by min-subscription is > trivial and requires O(P) so the overall order is > O(S*S*(1+L)*P) <= O(16*P*S^2) > > One could obtain the list of marked min-hop ports and then use a > modified cyclic list for avoiding the search for min subscription in > the case of LMC > 0. In that case the order could be reduced to: > O(S*S*(1+L)) ~= O(S*H) > > 5. Min Hop Table Implementation: > -------------------------------- > The proposed algorithm does not require storing the number of hops > arriving at the switch port - but only the fact a port is on the min > hop path. This allowed for another memory usage reduction if the min > hop table would be of boolean values. > > The issue then is in an efficiant iterator on the boolean (bits) > array. The tradeoff is thus the common memory versus runtime. > > (Anybody knowns off a fast boolean array lookup implementation ?) > > 6. Incremental Routing: > ----------------------- > Once the fabric is routed we can define an algorithm for performing > incremental routing changes. An obvious case is when a link is > declared un-healthy or one of the ports is dropped. Assuming the > recognition of the change is done by some other algorithm. The > following cases apply: > * If the link connects HCA and a switch the HCA is unreachable. No > routing change required. > * If the link is between switches: > * If there at least one another link between these switches: > o Spread all routes going through the failing port to the other > ports connecting to the same switch. > * If there is no other link to these switches > o Go back to all switches that feed into each one of the switches > (feed in means they route some target lids through the switch) > but only those that route lids that go through the failing port. Check > to see if there is another port that goes to a different switch > to route that lid to. If there is no other way do nothing. > > How do we support topology changes line moving an HCA from one Switch > to another? > > 7. Routing Persistancy: > ----------------------- > To make the subnet initialization faster, one could store the existing > routing solution and use it without any calculation. > > The issue is of course what conditions makes the stored routing > obsolete. To maximize the usefullnes of the stored information we > propose to store the Min Hop tables rather then the final port > assignment. It is assumed that after restart there might be a need for > modifications to LMC and routing which will invalidate the LFT > anyways. To enable "cache invalidation criteria" the persistent > database should include information that could be used to easily check > if the fabric was not altered in a way that invalidates the MinHop tables. > > The stored information should hold for each switch in the fabric (by guid) > the list of ports and the guids and port numbers on the remote side. > To validate there are no significant changes, the discovered set > of switches is checked to match the stored information. Table 1 > describes the possible changes and their effect on the validity of > the MinHop tables. > > Table 1 - Connectivity Changes effect on Routing Info Validity > > Change | Effect on MinHop Tables | Effect on LFT and MFT | > ------------------------------------------------------------------------- > New Switch found | Invalidates (might connect | Invalidates | > | more HCAs and carries more | | > | routing resources) | | > ------------------------------------------------------------------------- > Missing Switch | Invalidates (MinHops might | Invalidates | > | be broken a few steps away)| | > ------------------------------------------------------------------------- > New cable found | Does not invalidate | Does not invalidate | > ------------------------------------------------------------------------- > Missing Cable | Invalidates only if there | Invalidates all LIDs | > (SW to SW) | is no other cable | going through that port | > | connecting the switches | | > ------------------------------------------------------------------------- > Missing Cable | Does no invalidate | Does not invalidate | > (SW to HCA) | | | > ------------------------------------------------------------------------- > New HCA | Does no invalidate | Does not invalidate | > ------------------------------------------------------------------------- > Missing HCA | Does no invalidate | Does not invalidate | > ------------------------------------------------------------------------- > LID Changes | Does no invalidate | Invalidates the modified| > | | LIDs | > ------------------------------------------------------------------------- > > Special marking for "root nodes" shold cache the results of the first > step for Up/Dpwn routing. These nodes should be invalidated on any > missing or additional switch conditions. > > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
