> > > > > > > > A plan for OpenSM to support loading unicast routes already exists. > > > > > > What's the timeframe for implementing this ? Has the implementation > > > already started ? Is it what you write below and had posted to the > > list > > > last summer ? > > [EZ] Yes: it is the same proposal. No: the work has not started. > > > > > > > To do it we need to develop a scheme where the routes file also > > holds > > > > the topology using GUIDs > > > > such that the discovered topology is compared to the saved one. I > > have > > > > outlined the algorithm in > > > > the attached writeup on OpenSM routing algorithms section 6: > > > > "Incremental Algorithms". > > > > > > Wouldn't it be better to separate out the persistency and incremental > > > issues ? > > [EZ] I think that once you load the routes in the file you probably need > > to compare the topology used for generating them with the current one. > > Absolutely. > > > The complexity level of "incremental" support - i.e. the ability to > > gracefully handle some topology change - might be built with time. So > > first implementation might be limited to compare with reference topology > > and reroute on any change. > > The reroute on change might ultimately be the default but there might be > an option to disable this. [EZ] OK I see. But this means you intentionally agree to have partial connectivity (i.e. no route from every host to every other host). > > > But it is a bit crude in my mind. > > Yes. It would allow for the experimentation I think they desire though. > > > > Also, I sent numerous questions and comments on this writeup some time > > > ago. > > [EZ] I thought I did answer them. > > I don't think so but I need to dig this out of my email. It's been quite > a while... [EZ] I'll try my archive too > > > > > The right place to implement this will be to have the osm_db* > > enhanced > > > > to support this new DB domain. > > > > Then the osm_ucast_mgr.c will need to initialize the DB and use it > > while > > > > routing. > > > > > > Also, changes versus this either need disabling or handling. > > [EZ] Not following you here. If you mean we need to be able to "shut > > off" this new capability - sure. > > I was referring to more than this (in terms of disabling the reroute on > change). [EZ] OK. > > -- Hal > > > > -- Hal > > > > > > > Regarding dump of existing tables: > > > > > > > > If you run opensm -V or -D 0x43 you should get the file > > /tmp/osm.fdbs > > > > with that dump. > > > > You can also use a more SM independent method for obtaining the > > tables : > > > > if you run ibdiagnet you should get the file /tmp/ibdiagnet.fdbs > > with > > > > similar format. > > > > > > > > Eitan > > > > > > > > Greg Johnson wrote: > > > > > On Thu, May 04, 2006 at 12:39:38PM -0400, Hal Rosenstock wrote: > > > > > > > > > >>Hi Greg, > > > > >> > > > > >>On Thu, 2006-05-04 at 12:34, Greg Johnson wrote: > > > > >> > > > > >>>Is there currently a way to dump and load routes with opensm? If > > > > not, > > > > >>>how would I go about writing one? > > > > >> > > > > >>Is it really routes or stable LIDs you want ? > > > > > > > > > > > > > > > I actually want routes. I have queried them with ibtraceroute and > > > > > ibroute, but we need routes for the whole fabric. > > > > > > > > > > BTW, if you call ibtraceroute thousands of times it stops working. > > > > > Maybe a problem in the MAD driver? > > > > > > > > > > > > > > > > > > > >>LIDs are stored in /var/cache/osm/guid2lid and restored from there > > > > when > > > > >>OpenSM is started assuming the reassign LIDs option (-r or > > > > >>--reassign_lids) is not used when invoking OpenSM. > > > > > > > > > > > > > > > Thanks, that's good to know. > > > > > > > > > > Greg > > > > > _______________________________________________ > > > > > openib-general mailing list > > > > > [email protected] > > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > To unsubscribe, please visit > > > > http://openib.org/mailman/listinfo/openib-general > > > > > > > > > > > > > > > > > > > > > > ______________________________________________________________________ > > > > > > > > OpenSM Unicast Routing Enhancements for Scalability > > > > ===================================================== > > > > > > > > Authors:Eitan Zahavi <[EMAIL PROTECTED]>, Yael Kalka > > <[EMAIL PROTECTED]> > > > > Date: Aug 2005. > > > > > > > > Table of contents: > > > > 1. Overview > > > > 2. Notation > > > > 3. Current Algorithms > > > > 4. Proposed Routing Algorithms > > > > 5. Min Hop Tables Implementation > > > > 6. Incremental Routing > > > > 7. Routing Persistancy > > > > > > > > 1. Overview: > > > > ------------ > > > > > > > > OpenSM currently uses a two stage routing algorithm for unicast > > > > forwarding tables calculation. As shown later these algorithm are > > > > O(N^3). Inspected run time of OpenSM routing stage was ~2.5min on > > 1100 > > > > nodes cluster. The purpose of this memo is to present the community > > > > the proposed work for enhancing OpenSM routing engine. > > > > > > > > 2. Notation: > > > > ------------ > > > > The following notations are used throughout this document: > > > > S = Number of switch devices in the system > > > > P = Number of ports each switch node has > > > > H = Number of HCA ports connected to the fabric > > > > L = Number of HCAs connected to each Leaf switch device. > > > > Normal values are 1/2P to 3/4P > > > > D = Fat Tree depth > > > > > > > > 3. Current Algorithms: > > > > ---------------------- > > > > OpenSM provide two routing algorithms: Minimal Hop and Up/Down. Both > > > > of them do not scale with cluster size and can consume both large > > run > > > > time (minutes) and memory (GB). This section provides meta code for > > > > these algorithms and order calculation. > > > > > > > > 3.1 Min Hop algorithm analysis: > > > > The Min Hop algorithm is divided into two stages: computation of > > > > min-hop tables on every switch and LFT output port assignment. > > > > > > > > Step 1: Computation of Min-Hop Tables on each switch > > > > > > > > The memory consumed is S*(S+H)*(P+2)*Byte. On 10K nodes cluster with > > > > 2500 switch devices this ends up as 812M-Byte (using LMC=0). > > > > > > > > Meta algorithm: > > > > For each HCA mark its remote neighbor switch port with hop 1. > > > > For each switch mark itself port 0 as hop 0 > > > > While changes > > > > For each switch > > > > For each port > > > > For each LID > > > > Propagate remote port hop as hop +1 if smaller or undefined > > > > > > > > The order of this step: O(S*P*(S+H)*(D+1)) > > > > > > > > Step 2: Assigning output port: > > > > For each switch > > > > For each LID > > > > For each Port > > > > Is it the one with min count > > > > > > > > The order of this step: O(S*(S+H)*P) > > > > > > > > 3.2 Up/Down algorithm analysis > > > > The Up/Down algorithm depends on the ability to rank the fabric > > nodes > > > > from root to leaf of the tree. To get that ranking it runs a > > > > heuristics that is based on the Min Hop tables. So the memory and > > > > complexity are identical to the Min-Hop first step to start with. > > > > > > > > Once ranking is performed the algorithm is BFS from every HCA and > > fill > > > > in the Min Hop tables again. Up/Down traversal rule is enforced > > during > > > > the BFS such that only valid turns are allowed. > > > > > > > > Meta algorithm: > > > > For each HCA > > > > Get connected Switch > > > > For each Switch in NextSwitches > > > > For each Port > > > > Check if direction is OK. Check if not visited > > > > > > > > The order is O(H*S*P) > > > > > > > > To finalize output port assignment the second step of the Min Hop > > > > algorithm is invoked. > > > > > > > > > > > > 4. Proposed Algorithm: > > > > ---------------------- > > > > > > > > Inspecting the routing problem we have noticed the following > > > > attributes: > > > > a. Using Min-Hop tables for keeping intermediate routing > > information > > > > has a disadvantage in terms of memory consumption. However, any > > > > incremental routing algorithm (for handling fabric changes after > > > > first setup), or routing persistence solution could use this > > > > information and gain speed. > > > > b. Since we need to fill in LFT tables that are of the order S*(S+H) > > > > the algorithm is lower bounded by O(H^2). > > > > c. A persistence based solution which uses previously routed fabric > > > > data and is able to handle simple incremental changes will > > provide > > > > a much faster runtime as topology match will require O(S*P) > > > > (traversing all links once) > > > > d. Since the minimum hops information is identical for a switch and > > > > all the HCAs connected to it - there is no point in building "min > > > > hop" tables for HCAs. During the "output port" assignment stage, > > > > the HCAs connected to each switch are considered and routed. > > > > > > > > The result of "a" is that several algorithms that are superior from > > > > memory footprint and skip any "hop table" stage are not considered > > for > > > > implementation. > > > > > > > > To support "d" we needed to provide an index to each switch such > > that > > > > the "min hop" tables are dense (previously they were indexed by > > LID). > > > > The new index is stored on the switch object and thus allow lookup > > of > > > > a switch "min hop" by its index. An array of switch pointer by index > > > > supports the reverse lookup. > > > > > > > > The suggested algorithm is broken into the following 3 stages: > > > > * Root nodes identification heuristics > > > > * Min Hop tables computation > > > > * Output port assignment > > > > > > > > 4.1 Root nodes identification heuristics: > > > > This step is only required under the AND of the following two > > > > conditions: > > > > * Up/Down routing is required > > > > * The user does not provide a file with guids of the tree "root > > nodes" > > > > > > > > This heuristics for recognizing the tree roots is based on > > histograms > > > > of the HCAs distance from every switch. > > > > I.e. How many HCAs are 1 hop, 2 hops from the switch. In order to > > fill > > > > in these histograms on all switches we need to BFS from every leaf > > > > switch and propagate the number of HCAs connected to it: > > > > > > > > Meta algorithm: > > > > For each switch > > > > For each Port > > > > If connected to HCAs count them > > > > If any HCA > > > > Init BFS to start with current switch > > > > set hop count to 0 > > > > While there are switches in BFS list > > > > increment hop count > > > > For each switch in BFS List > > > > Add the number of HCAs to the histogram at the current hop > > count > > > > For each port > > > > If remote port switch not visited > > > > Add the switch to the BFS Next Step List > > > > Once finished all this step list use next steps > > > > > > > > The order of this step is: O(S*P + H/L*S*P) = O(*H*S) > > > > > > > > 4.2 Min Hop tables computation: > > > > This step is mandatory and has a slightly different flavor for the > > > > case of Up/Down routing. > > > > > > > > The algorithm starts from every leaf switch and traverses BFS wise > > > > through the fabric. > > > > > > > > Meta algorithm: > > > > foreach switch in the fabric > > > > clear the "Rank" vector for all switches. > > > > start BFS with the given switch. > > > > set rank to 0 > > > > while any switch in BFS list > > > > | foreach switch in BFS switch list > > > > | |foreach port (valid, active, not unhealthy) > > > > | | if remote side is a switch: > > > > | | if rank of remote side 0 or = rank + 1 > > > > | | set the remote port entry MinHopPort for this switch > > > > | | if rank of remote side 0 i.e. never visited > > > > | | set the remote switch rank to rank + 1 > > > > | | add the remote switch to next BFS switches > > > > | |------ > > > > | switch between the current and next switches list > > > > | increment rank > > > > |------ > > > > > > > > The order of this algorithm: O(P*S^2) > > > > > > > > Algorithm that merges Up/Down step criteria not yet written for this > > > > stage. But the idea is to make each step keep track of any previous > > > > step down. Such that a step up will be prohibited in this case. > > > > > > > > 4.3 Output port assignment: > > > > This step provide actual LFT values assignment to each switch. > > > > To do that we access the built "min hop" tables and track port > > usage. > > > > > > > > Meta algorithm: > > > > foreach switch in the fabric > > > > clear the port subscription vector (track number of paths > > subscribed) > > > > foreach target switch in the "min hop" table > > > > get the list of min-hop ports > > > > foreach end-node attached (HCA connected to it and itself) > > > > if lmc > 0 init tracking of remote system and node for selecting > > > > disjoint paths for same end-node different LID LSBs > > > > get min-subscribed (and disjoint) port marked min-hop target > > switch > > > > track port usage in port-subscription (opt. if target LID is not > > a switch) > > > > > > > > Order of this step: > > > > Currently the selection of the output port by min-subscription is > > > > trivial and requires O(P) so the overall order is > > > > O(S*S*(1+L)*P) <= O(16*P*S^2) > > > > > > > > One could obtain the list of marked min-hop ports and then use a > > > > modified cyclic list for avoiding the search for min subscription in > > > > the case of LMC > 0. In that case the order could be reduced to: > > > > O(S*S*(1+L)) ~= O(S*H) > > > > > > > > 5. Min Hop Table Implementation: > > > > -------------------------------- > > > > The proposed algorithm does not require storing the number of hops > > > > arriving at the switch port - but only the fact a port is on the min > > > > hop path. This allowed for another memory usage reduction if the min > > > > hop table would be of boolean values. > > > > > > > > The issue then is in an efficiant iterator on the boolean (bits) > > > > array. The tradeoff is thus the common memory versus runtime. > > > > > > > > (Anybody knowns off a fast boolean array lookup implementation ?) > > > > > > > > 6. Incremental Routing: > > > > ----------------------- > > > > Once the fabric is routed we can define an algorithm for performing > > > > incremental routing changes. An obvious case is when a link is > > > > declared un-healthy or one of the ports is dropped. Assuming the > > > > recognition of the change is done by some other algorithm. The > > > > following cases apply: > > > > * If the link connects HCA and a switch the HCA is unreachable. No > > > > routing change required. > > > > * If the link is between switches: > > > > * If there at least one another link between these switches: > > > > o Spread all routes going through the failing port to the other > > > > ports connecting to the same switch. > > > > * If there is no other link to these switches > > > > o Go back to all switches that feed into each one of the > > switches > > > > (feed in means they route some target lids through the > > switch) > > > > but only those that route lids that go through the failing > > port. Check > > > > to see if there is another port that goes to a different > > switch > > > > to route that lid to. If there is no other way do nothing. > > > > > > > > How do we support topology changes line moving an HCA from one > > Switch > > > > to another? > > > > > > > > 7. Routing Persistancy: > > > > ----------------------- > > > > To make the subnet initialization faster, one could store the > > existing > > > > routing solution and use it without any calculation. > > > > > > > > The issue is of course what conditions makes the stored routing > > > > obsolete. To maximize the usefullnes of the stored information we > > > > propose to store the Min Hop tables rather then the final port > > > > assignment. It is assumed that after restart there might be a need > > for > > > > modifications to LMC and routing which will invalidate the LFT > > > > anyways. To enable "cache invalidation criteria" the persistent > > > > database should include information that could be used to easily > > check > > > > if the fabric was not altered in a way that invalidates the MinHop > > tables. > > > > > > > > The stored information should hold for each switch in the fabric (by > > guid) > > > > the list of ports and the guids and port numbers on the remote side. > > > > To validate there are no significant changes, the discovered set > > > > of switches is checked to match the stored information. Table 1 > > > > describes the possible changes and their effect on the validity of > > > > the MinHop tables. > > > > > > > > Table 1 - Connectivity Changes effect on Routing Info Validity > > > > > > > > Change | Effect on MinHop Tables | Effect on LFT and MFT > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > New Switch found | Invalidates (might connect | Invalidates > > | > > > > | more HCAs and carries more | > > | > > > > | routing resources) | > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > Missing Switch | Invalidates (MinHops might | Invalidates > > | > > > > | be broken a few steps away)| > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > New cable found | Does not invalidate | Does not invalidate > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > Missing Cable | Invalidates only if there | Invalidates all LIDs > > | > > > > (SW to SW) | is no other cable | going through that > > port | > > > > | connecting the switches | > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > Missing Cable | Does no invalidate | Does not invalidate > > | > > > > (SW to HCA) | | > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > New HCA | Does no invalidate | Does not invalidate > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > Missing HCA | Does no invalidate | Does not invalidate > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > LID Changes | Does no invalidate | Invalidates the > > modified| > > > > | | LIDs > > | > > > > > > ------------------------------------------------------------------------ > > - > > > > > > > > Special marking for "root nodes" shold cache the results of the > > first > > > > step for Up/Dpwn routing. These nodes should be invalidated on any > > > > missing or additional switch conditions. > > > > > > > > _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
