> RDMA_CM_EVENT_UNREACHABLE is indicated when there are timeouts in > underlying CM protocol exchange. I suspect that the server is really > busy and doesn't respond to the low level CM MADs in a timely manner. > RDMA CM (and other kernel ULPs like IPoIB and SRP use hard coded local > and remote response timeouts of 20 which is ~4.3 sec. This was discussed > back in 2006 in > http://comments.gmane.org/gmane.linux.drivers.openib/27664. In this > scenario, the response took more than 30 seconds. More recently, there > was proposal to base RDMA CM response timeout on subnet timeout > (http://permalink.gmane.org/gmane.linux.drivers.rdma/19969).
Hal's assessment seems likely. Error code -110 is ETIMEDOUT. However, the IB CM timeout when used through the RDMA CM should be much larger, as it makes use of the CM MRA protocol. Unless a lot of MADs are being lost, or I'm not remembering the RDMA CM code correctly, there's still an issue here that I'm not understanding. - Sean
