On Thu, 2005-04-07 at 16:02, Eitan Zahavi wrote: > Hi Hal, > > Please see my comments below. > > Eitan Zahavi > > > Problem Statement: > > > > Currently, OpenSM issues (directed route) SubnGet for NodeInfo and > > NodeDescription to any node it finds. It then requests PortInfo for > > each port which is physically up. > > > > There are scenarios where the port is physically up, but there is no > > response to the SM get requests. In this case, the OpenSM keeps > > retrying, never gives up, and doesn't service anything else in the > > subnet (I'm not 100% positive on this last point). > [EZ] I have never seen this! Are you sure about it? Are you sure we > are talking about gen1 ported to gen2? > > What will happen in a case of non responding port is that OpenSM will > retry the send (actually the lower level does it) for the number of > retries OpenSM is configured to use (actually 4 times) and then ignore > the port and everything behind it. The reported topology (on stdout) > will have the word UNKNOWN on the remote side of the link this port > connects to. > > I will be happy to see a log file that shows what you claim happens. > Or even if you can explain to me how and where in the code causes > that.
This was reported by Ron a while ago on this list. He sent log extracts of what was going on. It was around when I asked about the Anafa firmware issue with LFTTop. > I have been checking the way OpenSM handles irresponsive ports during > the the last two weeks, and did not see such case. Is this in both Gold 1.6.1 (OpenSM 1.7/1.7.1 ?) and Gold 1.7 (OpenSM 1.8) ? > > Assumption: > > > > The proposed solution assumes that the ignore GUIDs file option of > > OpenSM only impacts the routing algorithm (path counting) and should > not > > be extended for bad port handling. > [EZ] This is correct. > > > > Proposed Solution: > > > > The OpenSM will implement a configurable policy (some number of > > consecutive lack of responses to SM requests). At the point of > > exhaustion of the timeout/retry strategy, that port will be marked > as > > "bad" by OpenSM. > [EZ] This is already the current behavior. Nothing should be done. > > > > At this point, should it attempt to revive the port by bringing the > > physical link down and back up ? Should it try this several times > before > > declaring the port as "bad" ? In any case, this is a refinement on > the > > basic strategy for dealing with this scenario. > > > > Also, there could also be a periodic "ping" at a slower rate to > check if > > the "bad" ports revive. > [EZ] This will be released in gen1 within 2 weeks or so. What OpenSM release will this be ? > The enhancement to light sweep will include the irresponsive ports in > the light sweep. Once they respond a new heavy sweep will be > generated. > > > > > A "bad" port per this scenario still maintains its LID and other > state. > > OpenSM will indicate a "bad" port detected via an internal port > physical > > state which it will set to down. The "real" port physical state will > be > > reflected accurately inside OpenSM. > [EZ] It is better to use the "un-healthy" bit of the physical port - > which OpenSM is already maintaining. > > > > Once a "bad" port is detected, it will no longer be polled and the > > routing algorithm should be invoked to route around this. > > > > Is there a need to store these "bad" ports persistently (and ignore > them > > on startup) ? > [EZ] No I do not think so. Thanks. -- Hal _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
