I know there is a bug with "ib_send_bw -b" (bi-directional)
since it doesn't create a CQ that is large enough for all the
posted sends *and* receives. I have tried several times to get the
following patch applied but I never got a reply and nothing was
done.
diff --git a/send_bw.c b/send_bw.c
index ddd2b73..e3f644a 100644
--- a/send_bw.c
+++ b/send_bw.c
@@ -746,6 +746,8 @@ static struct pingpong_context *pp_init_ctx(struct
ibv_device *ib_dev,
if (user_parm->use_mcg && !user_parm->servername) {
cq_rx_depth *= user_parm->num_of_clients_mcg;
}
+ if (user_parm->duplex)
+ cq_rx_depth += ctx->tx_depth;
ctx->cq = ibv_create_cq(ctx->context,cq_rx_depth, NULL, ctx->channel,
0);
if (!ctx->cq) {
fprintf(stderr, "Couldn't create CQ\n");
There should be enough CQEs in the normal case though.
On Fri, 2010-08-13 at 11:44 -0700, Sumeet Lahorani wrote:
> Hi,
>
> If I run ib_send_bw with the -a option, we seem to be getting CQ overrun
> errors.
>
> Server :
> [r...@dscbad01 ~]# ib_send_bw
> ------------------------------------------------------------------
> Send BW Test
> Connection type : RC
> Inline data is used up to 1 bytes message
> local address: LID 0x24, QPN 0x1c004c, PSN 0x85c292
> remote address: LID 0x2a, QPN 0x14004a, PSN 0x858358
> Mtu : 2048
> ------------------------------------------------------------------
> #bytes #iterations BW peak[MB/sec] BW average[MB/sec]
> ------------------------------------------------------------------
>
> Client :
> [r...@dscbad03 ~]# ib_send_bw -a dscbad01
> ------------------------------------------------------------------
> Send BW Test
> Connection type : RC
> Inline data is used up to 1 bytes message
> local address: LID 0x2a, QPN 0x14004a, PSN 0x858358
> remote address: LID 0x24, QPN 0x1c004c, PSN 0x85c292
> Mtu : 2048
> ------------------------------------------------------------------
> #bytes #iterations BW peak[MB/sec] BW average[MB/sec]
> 2 1000 5.99 5.45
> Completion wth error at client:
> Failed status 12: wr_id 1 syndrom 0x81
> scnt=600, ccnt=300
>
> and on the client console
>
> mlx4_core 0000:13:00.0: CQ overrun on CQN 000086
> mlx4_core 0000:13:00.0: Internal error detected:
> mlx4_core 0000:13:00.0: buf[00]: 00328f6f
> mlx4_core 0000:13:00.0: buf[01]: 00000000
> mlx4_core 0000:13:00.0: buf[02]: 20070000
> mlx4_core 0000:13:00.0: buf[03]: 00000000
> mlx4_core 0000:13:00.0: buf[04]: 00328f3c
> mlx4_core 0000:13:00.0: buf[05]: 0014004a
> mlx4_core 0000:13:00.0: buf[06]: 00340000
> mlx4_core 0000:13:00.0: buf[07]: 00000044
> mlx4_core 0000:13:00.0: buf[08]: 00000804
> mlx4_core 0000:13:00.0: buf[09]: 00000804
> mlx4_core 0000:13:00.0: buf[0a]: 00000000
> mlx4_core 0000:13:00.0: buf[0b]: 00000000
> mlx4_core 0000:13:00.0: buf[0c]: 00000000
> mlx4_core 0000:13:00.0: buf[0d]: 00000000
> mlx4_core 0000:13:00.0: buf[0e]: 00000000
> mlx4_core 0000:13:00.0: buf[0f]: 00000000
>
> This is with OFED 1.5.1 but it also happens with OFED 1.4.2. Sometimes,
> the node crashes because it runs out of memory but most of the time, I
> see just the above errors. What could be wrong?
>
> - Sumeet
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html