I probably did not make point very clear:
It is bad (not to say wrong) to disqualify a port and mark it as bad port if it did not respond to queries.
The cause of the issue might be a flaky link on the directed route to the port.
If the SM would be able to find that flaky link port it would avoid marking the wrong ports. More over, the port that was almost marked as bad by the simplistic algorithm you propose will be discovered and operational as there many other paths to reach it - walking around the real bad port !
Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL
> -----Original Message-----
> From: Hal Rosenstock [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, April 13, 2005 12:00 PM
> To: Eitan Zahavi
> Cc: [email protected]
> Subject: RE: [openib-general] SM Bad Port Handling
>
> On Wed, 2005-04-13 at 01:28, Eitan Zahavi wrote:
> > [EZ] This is true. Currently there is only one cause for the
> > un-healthy bits to be set - which are exactly as you point - these
> > traps. The point I was trying to make was that this bit is the
> > mechanism for flagging a port status is bad.
> >
> > What I did recommend was to write a "statistical" analysis of Directed
> > Route packet drop - such that we can find the ports with a high drop
> > rate and mark them as un-healthy. If you mark every port that does not
> > respond to a MAD as un-healthy you can suffer from flaky links
> > somewhere on the route to that port. Only analysis of the number of
> > good packets vs. dropped packets can lead you to the right bad port.
>
> The original proposal on this said the following:
>
> "The OpenSM will implement a configurable policy (some number of
> consecutive lack of responses to SM requests). At the point of
> exhaustion of the timeout/retry strategy, that port will be marked as
> "bad" by OpenSM."
>
> Any idea on what might make a good default threshold (for consecutive
> retries) ? Do you think there is no sufficient default ?
>
> If a link is flaky and MADs can't get through, should it be used for non
> MAD traffic ?
>
> Also note that the proposal also said:
>
> "Also, there could also be a periodic "ping" at a slower rate to check
> if the "bad" ports revive."
>
> In terms of analysis of good v. errored and dropped packets (along the
> path to that node), there are OpenIB diagnostic tools to help with this.
>
> -- Hal
_______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
