Hi,

If I run ib_send_bw with the -a option, we seem to be getting CQ overrun errors.

Server :
[r...@dscbad01 ~]# ib_send_bw
------------------------------------------------------------------
                   Send BW Test
Connection type : RC
Inline data is used up to 1 bytes message
 local address:  LID 0x24, QPN 0x1c004c, PSN 0x85c292
 remote address: LID 0x2a, QPN 0x14004a, PSN 0x858358
Mtu : 2048
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] ------------------------------------------------------------------

Client :
[r...@dscbad03 ~]# ib_send_bw -a dscbad01
------------------------------------------------------------------
                   Send BW Test
Connection type : RC
Inline data is used up to 1 bytes message
 local address:  LID 0x2a, QPN 0x14004a, PSN 0x858358
 remote address: LID 0x24, QPN 0x1c004c, PSN 0x85c292
Mtu : 2048
------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] 2 1000 5.99 5.45
Completion wth error at client:
Failed status 12: wr_id 1 syndrom 0x81
scnt=600, ccnt=300

and on the client console

mlx4_core 0000:13:00.0: CQ overrun on CQN 000086
mlx4_core 0000:13:00.0: Internal error detected:
mlx4_core 0000:13:00.0:   buf[00]: 00328f6f
mlx4_core 0000:13:00.0:   buf[01]: 00000000
mlx4_core 0000:13:00.0:   buf[02]: 20070000
mlx4_core 0000:13:00.0:   buf[03]: 00000000
mlx4_core 0000:13:00.0:   buf[04]: 00328f3c
mlx4_core 0000:13:00.0:   buf[05]: 0014004a
mlx4_core 0000:13:00.0:   buf[06]: 00340000
mlx4_core 0000:13:00.0:   buf[07]: 00000044
mlx4_core 0000:13:00.0:   buf[08]: 00000804
mlx4_core 0000:13:00.0:   buf[09]: 00000804
mlx4_core 0000:13:00.0:   buf[0a]: 00000000
mlx4_core 0000:13:00.0:   buf[0b]: 00000000
mlx4_core 0000:13:00.0:   buf[0c]: 00000000
mlx4_core 0000:13:00.0:   buf[0d]: 00000000
mlx4_core 0000:13:00.0:   buf[0e]: 00000000
mlx4_core 0000:13:00.0:   buf[0f]: 00000000

This is with OFED 1.5.1 but it also happens with OFED 1.4.2. Sometimes, the node crashes because it runs out of memory but most of the time, I see just the above errors. What could be wrong?

- Sumeet

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to