On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote: > [EZ] This is true. Currently there is only one cause for the > un-healthy bits to be set - which are exactly as you point - these > traps. The point I was trying to make was that this bit is the > mechanism for flagging a port status is bad. > > What I did recommend was to write a "statistical" analysis of Directed > Route packet drop - such that we can find the ports with a high drop > rate and mark them as un-healthy. If you mark every port that does not > respond to a MAD as un-healthy you can suffer from flaky links > somewhere on the route to that port. Only analysis of the number of > good packets vs. dropped packets can lead you to the right bad port.
The original proposal on this said the following: "The OpenSM will implement a configurable policy (some number of consecutive lack of responses to SM requests). At the point of exhaustion of the timeout/retry strategy, that port will be marked as "bad" by OpenSM." Any idea on what might make a good default threshold (for consecutive retries) ? Do you think there is no sufficient default ? If a link is flaky and MADs can't get through, should it be used for non MAD traffic ? Also note that the proposal also said: "Also, there could also be a periodic "ping" at a slower rate to check if the "bad" ports revive." In terms of analysis of good v. errored and dropped packets (along the path to that node), there are OpenIB diagnostic tools to help with this. -- Hal _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
