legionxiong commented on PR #3145:
URL: https://github.com/apache/brpc/pull/3145#issuecomment-3544671212

   > We encountered an error: requests were not being sent, causing a large 
number of client timeouts.
   > 
   > ```shell
   >  [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx} 
(0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED, 
rdma_remote_rq_window_size=63, rdma_sq_window_size=0, 
rdma_local_window_capacity=125, rdma_remote_window_capacity=125, 
rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0, 
rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1, 
rdma_new_rq_wrs=0, }
   > ```
   > 
   > From the RDMA connection information, we found that because 
`ibv_req_notify_cq` was only solicited, send WCs did not generate a CQEs. 
Without recv CQEs, send WCs could not be polled, so ยท_sq_window_size` remained 
at 0. This is likely the reason why both the client and server are unable to 
send messages.
   > 
   > Using `ibv_req_notify_cq` with `solicited_only=0` could solve this 
problem, but it would generate too many events. Therefore, we split the CQ into 
`send_cq`(`solicited_only=0`) and `recv_cq`(`solicited_only=1`).
   
   It is unnecessary to split CQ into send_cq and recv_cq, the reason that 
sliding window goes wrong is the precondition it relies is not guaranteed in 
RoCE environment.  The sliding window mechanism presumes that an local app 
receives the ack from remote app means that the underlying SQ has already been 
released,  in a lossless IB environment it is true. But in RoCE environment, it 
is not guaranteed. Because the local side releases SQ only if it receives the 
ack to a message from remote IB device, but the ack might be lost and the 
remote side has processed the message, which means the remote app has received 
the message and send a application layer ack to local side. Unfortunately, the 
local app layer received the ack to a message from remote app, but the device 
layer is still waiting for the ack from remote device to release the SQ. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to