Hal said: Should a busy be retried at all at the mad layer ? Is a "special" longer) timeout policy for busy needed ?
Also, should this be done for all MADs classified by ib_response_mad (e.g. trap represses) ? Hal, The idea of processing BUSY responses in the MAD layer is to BUSY responses like timeouts - which are currently handled by the MAD layer. Right now there is an issue where various apps and ULPs either treat BUSY as a cause to immediately retry or as a permanent error. This doesn't seem to affect users of the OpenSM so much because (as I understand it) the OpenSM seems to discard requests when it gets too busy - but for other SA/SMs, it can cause a major packet storm or, worse, a simple loss of connectivity where MPI jobs or kernel ULPs simply assume the SA is broken because they got a BUSY reply. By treating the BUSY reply as a timeout, we're actually simplifying matters by fitting into existing practice. As for needing a longer timeout - in our old proprietary stack, QLogic did have a longer timeout for retrying busy replies than for normal timeouts - but we should try to get this in now so we can get some relief before we begin the long term discussion of the best way to handle this issue overall.
