Hal said:
Should a busy be retried at all at the mad layer ? Is a "special" longer) 
timeout policy for busy needed ?

Also, should this be done for all MADs classified by ib_response_mad (e.g. trap 
represses) ?

Hal, 

The idea of processing BUSY responses in the MAD layer is to BUSY responses 
like timeouts - which are currently handled by the MAD layer. Right now there 
is an issue where various apps and ULPs either treat BUSY as a cause to 
immediately retry or as a permanent error. This doesn't seem to affect users of 
the OpenSM so much because (as I understand it) the OpenSM seems to discard 
requests when it gets too busy - but for other SA/SMs, it can cause a major 
packet storm or, worse, a simple loss of connectivity where MPI jobs or kernel 
ULPs simply assume the SA is broken because they got a BUSY reply.

By treating the BUSY reply as a timeout, we're actually simplifying matters by 
fitting into existing practice.

As for needing a longer timeout - in our old proprietary stack, QLogic did have 
a longer timeout for retrying busy replies than for normal timeouts - but we 
should try to get this in now so we can get some relief before we begin the 
long term discussion of the best way to handle this issue overall.

Reply via email to