Krishnamoorthy, Sriram wrote:

Can someone please explain what can cause IBV_WC_RETRY_EXC_ERR? I am using a combination of send-receive and RDMA. I have the reliable connection queue pairs initialized as:

IBV_WC_RETRY_EXE_ERR means that there wasn't any ack by the receiver after 4.096*(2 power 18) * 7 usec.
It can happen because of several reasons:
1) bad QP attributes
2) the remote side wasn't exists or it is in bad state
3) rare, but congestion in the network can causes this too

    qp_attr.timeout             = 18;
    qp_attr.retry_cnt           = 7;
    qp_attr.rnr_retry           = 7;

From the documentation, I assumed a value of 7 meant infinite retry. Can lack of receive buffers cause this error? I understand IBV_WC_RNR_RETRY_EXC_ERR to be the error caused by lack of receive buffers. Could it be congestion in the network?

7 means infinite retry only for RNR flow, for retry flow 7 is the number of time of the retransmission.

I could not find much from earlier queries related to this error. It often occurs in the middle of the computation on large (>=1024) processor counts, when I try to have multiple outstanding send-recvs between a pair of processes (each pair of processes has a RC queue pair initialized). I do not have a small test case yet that can repeat this error.


How do you connect the both sides?
maybe the sender send messages to QP wasn't transfered to (at least) RTR state?

Dotan

_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to