legionxiong commented on PR #3145:
URL: https://github.com/apache/brpc/pull/3145#issuecomment-3544671212
> We encountered an error: requests were not being sent, causing a large
number of client timeouts.
>
> ```shell
> [E1008]Reached timeout=60000ms @Socket{id=13 fd=1160 addr=xxx:xx}
(0x0x7f957c964ec0) rdma info={rdma_state=ON, handshake_state=ESTABLISHED,
rdma_remote_rq_window_size=63, rdma_sq_window_size=0,
rdma_local_window_capacity=125, rdma_remote_window_capacity=125,
rdma_sbuf_head=57, rdma_sbuf_tail=120, rdma_rbuf_head=36, rdma_unacked_rq_wr=0,
rdma_received_ack=0, rdma_unsolicited_sent=0, rdma_unsignaled_sq_wr=1,
rdma_new_rq_wrs=0, }
> ```
>
> From the RDMA connection information, we found that because
`ibv_req_notify_cq` was only solicited, send WCs did not generate a CQEs.
Without recv CQEs, send WCs could not be polled, so ยท_sq_window_size` remained
at 0. This is likely the reason why both the client and server are unable to
send messages.
>
> Using `ibv_req_notify_cq` with `solicited_only=0` could solve this
problem, but it would generate too many events. Therefore, we split the CQ into
`send_cq`(`solicited_only=0`) and `recv_cq`(`solicited_only=1`).
It is unnecessary to split CQ into send_cq and recv_cq, the reason that
sliding window goes wrong is the precondition it relies is not guaranteed in
RoCE environment. The sliding window mechanism presumes that an local app
receives the ack from remote app means that the underlying SQ has already been
released, in a lossless IB environment it is true. But in RoCE environment, it
is not guaranteed. Because the local side releases SQ only if it receives the
ack to a message from remote IB device, but the ack might be lost and the
remote side has processed the message, which means the remote app has received
the message and send a application layer ack to local side. Unfortunately, the
local app layer received the ack to a message from remote app, but the device
layer is still waiting for the ack from remote device to release the SQ.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]