Hey Yevgeny, This looks like a great idea. But is there a reason its only supported for LMC=0? Since the caching is handled at the ucast-mgr level (rather than in the routing algorithm code), I don't quite see why LMC=0 matters.
Maybe it is b/c of future incremental routing on your todo? If that's the case, instead of only caching when LMC=0, perhaps initial incremental routing should only work under LMC=0. Later on incremental routing for LMC > 0 could be added. Al On Sun, 2008-05-04 at 13:08 +0300, Yevgeny Kliteynik wrote: > One thing I need to add here: ucast cache is currently supported > for LMC=0 only. > > -- Yevgeny > > Yevgeny Kliteynik wrote: > > Hi Sasha, > > > > The following series of 4 patches implements unicast routing cache > > in OpenSM. > > > > None of the current routing engines is scalable when we're talking > > about big clusters. On ~5K cluster with ~1.3K switches, it takes > > about two minutes to calculate the routing. The problem is, each > > time the routing is calculated from scratch. > > > > Incremental routing (which is on my to-do list) aims to address this > > problem when there is some "local" change in fabric (e.g. single > > switch failure, single link failure, link added, etc). > > In such cases we can use the routing that was already calculated in > > the previous heavy sweep, and then we just have to modify it according > > to the change. > > > > For instance, if some switch has disappeared from the fabric, we can > > use the routing that existed with this switch, take a step back from > > this switch and see if it is possible to route all the lids that were > > routed through this switch some other way (which is usually the case). > > > > To implement incremental routing, we need to create some kind of unicast > > routing cache, which is what these patches implement. In addition to being > > a step toward the incremental routing, routing cache is usefull by itself. > > > > This cache can save us routing calculation in case of change in the leaf > > switches or in hosts. For instance, if some node is rebooted, OpenSM would > > start a heavy sweep with full routing recalculation when the HCA is going > > down, and another one when HCA is brought up, when in fact both of these > > routing calculation can be replaced by using of unicast routing cache. > > > > Unicast routing cache comprises the following: > > - Topology: a data structure with all the switches and CAs of the fabric > > - LFTs: each switch has an LFT cached > > - Lid matrices: each switch has lid matrices cached, which is needed for > > multicast routing (which is not cached). > > > > There is a topology matching function that compares the current topology > > with the cached one to find out whether the cache is usable (valid) or not. > > > > The cache is used the following way: > > - SM is executed - it starts first routing calculation > > - calculated routing is stored in the cache > > - at some point new heavy sweep is triggered > > - unicast manager checks whether the cache can be used instead > > of new routing calculation. > > In one of the following cases we can use cached routing > > + there is no topology change > > + one or more CAs disappeared (they exist in the cached topology > > model, but missing in the newly discovered fabric) > > + one or more leaf switches disappeared > > In these cases cached routing is written to the switches as is > > (unless the switch doesn't exist). > > If there is any other topology change: > > - existing cache is invalidated > > - topology is cached > > - routing is calculated as usual > > - routing is cached > > > > My simulations show that when the usual routing phase of the heavy > > sweep on the topology that I mentioned above takes ~2 minutes, > > cached routing reduces this time to 6 seconds (which is nice, if you > > ask me...). > > > > Of all the cases when the cache is valid, the most painful and > > "complainable" case is when a compute node reboot (which happens pretty > > often) causes two heavy sweeps with two full routing calculations. > > Unicast Routing Cache is aimed to solve this problem (again, in addition > > to being a step toward the incremental routing). > > > > -- Yevgeny > > _______________________________________________ > > general mailing list > > [email protected] > > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
