Sorry for late reply. 2010/6/12 Dotan Barak <[email protected]>: > On 12/06/2010 03:22, Ding Dinghua wrote: >> >> 2010/6/11 Dotan Barak<[email protected]>: >> >>> >>> Hi. >>> >>> On 11/06/2010 10:51, Ding Dinghua wrote: >>> >>>> >>>> Hi all: >>>> I'm using RDMA to do fs-metadata mirror between nodes. I >>>> encountered a strange problem when the program was running: >>>> Complete queue handler reported that the RDMA-Write operation failed, >>>> the status of corresponding "struct ib_wc" is "IB_WC_RETRY_EXC_ERR". >>>> The problem is encountered randomly. I don't know the meaning of this >>>> error code as well as what to do next. Would anyone give me some tips? >>>> thanks a lot. >>>> >>>> >>> >>> Do you sync between the sides before closing the QPs? >>> >> >> Can you say it more detail? thanks. >> > > If you try to send a message from local QP to a remote QP before the remote > QP is in RTR state (or after it was closed/transferred to the ERROR state), > you may get RETRY EXCEEDED, because there isn't any QP in the remote side > that can accept your message (and send a response). > > How do you connect the QPs? (And how do you close the connection between > them) > I call rdma_create_id to create an ib id, then do resolve remote addr, resolve route work, then setup qp and call rdma_connect to setup connection, before ack or error replies, the thread will wait on a wait queue. The listening ib id of remote node will catch the connect request, setup qp, allocate and map pages to construct the RDMA-WRITE space, and call rdma_accept to reply the request.
Some other information which may be useful: 1.All the "RETRY EXCEEDED" problems happened when there were two connections which use RDMA-WRITE to transfer things. And the latter connection had a high possibility to get into this problem. 2. All the "RETRY EXCEEDED" problems happened when the RMDA-WRITE space is 256MB each(that is, for two connections, consumes 512MB mem), when the RDMA-WRITE space is 64MB, this problem never happened in our test. Remote node's total memory is 2GB. Thanks a lot. > Dotan > -- Ding Dinghua -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
