I have a test program that does RDMA read-write as the following:
node A: server listens and handles connection requests
setup a piece of memory initialized to "0"
node B: two processes parent & child
child:
1. setup a new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
2. RDMA sequential write (8192 bytes a time) to server memory
4. sync with parent
parent:
1. setup the new channel with server, including a CQ with 1024 entries
(ibv_create_cq(ctx, 1024, NULL, channel, 0);)
3. RDMA sequential read (8192 byes a time) to the same piece of
memory from server
- check the buffer contents.
- if memory content is still zero, re-read
4. sync with child
The parent hangs (but child finishes its write) after the following
pops up in /var/log/messages:
mlx4_core 0000:06:00.0: CQ overrun on CQN 000087
I have my own counters that restrict the read (and write) to 512 max.
Both write and read are blocking (i.e. cq is polled after each
read/write). I suspect I do not have the cq poll logic correct. The
question here is .. is there any diag tool available to check on the
internal counters (and /or states) of ibverbs library and/or kernel
drivers (to help RDMA applications debug) ? In my case, it hangs
around 14546 block (i.e. after 14546*8192 byes).
Thanks,
Wendy
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html