Hi, Below is a writeup on bad port handling by the SM. I would appreciate any comments on this before I move on to the implementation.
Thanks. -- Hal Problem Statement: Currently, OpenSM issues (directed route) SubnGet for NodeInfo and NodeDescription to any node it finds. It then requests PortInfo for each port which is physically up. There are scenarios where the port is physically up, but there is no response to the SM get requests. In this case, the OpenSM keeps retrying, never gives up, and doesn't service anything else in the subnet (I'm not 100% positive on this last point). Assumption: The proposed solution assumes that the ignore GUIDs file option of OpenSM only impacts the routing algorithm (path counting) and should not be extended for bad port handling. Proposed Solution: The OpenSM will implement a configurable policy (some number of consecutive lack of responses to SM requests). At the point of exhaustion of the timeout/retry strategy, that port will be marked as "bad" by OpenSM. At this point, should it attempt to revive the port by bringing the physical link down and back up ? Should it try this several times before declaring the port as "bad" ? In any case, this is a refinement on the basic strategy for dealing with this scenario. Also, there could also be a periodic "ping" at a slower rate to check if the "bad" ports revive. A "bad" port per this scenario still maintains its LID and other state. OpenSM will indicate a "bad" port detected via an internal port physical state which it will set to down. The "real" port physical state will be reflected accurately inside OpenSM. Once a "bad" port is detected, it will no longer be polled and the routing algorithm should be invoked to route around this. Is there a need to store these "bad" ports persistently (and ignore them on startup) ? _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
