Jeff Carr wrote:May 5 16:31:50 localhost kernel: ib_mthca 0000:09:00.0: 1a0084/0: error CQE -> QPN 1a0406, WQE @ 00000042
May 5 16:31:50 localhost kernel: [ 0] 001a0406
May 5 16:31:50 localhost kernel: [ 4] 00001aed
May 5 16:31:50 localhost kernel: [ 8] 00000004
May 5 16:31:50 localhost kernel: [ c] 00003800
May 5 16:31:50 localhost kernel: [10] 128a0000
May 5 16:31:50 localhost kernel: [14] 00000000
May 5 16:31:50 localhost kernel: [18] 00000042
May 5 16:31:50 localhost kernel: [1c] ff000000
if you up the message_count to 0x1000. I'm guessing this is just some normal overrun error though.
It's taken me a while to look at this, but I think that this is a real error.
There must also be some limit to how many cqe's you can allocate with ib_post_recv(). (?)
Cmpost is setting the CQ size too small, which can lead to the CQ overrun. The number of cqe's should have been message_count * 2, rather than just message_count. Message_count is fine on the client side, which receives all messages before sending. But on the server side, receives could begin coming in before all sends are done.
OK. Wow. That makes cqe's and ib_post_recv() even more confusing then.
There must be some way to delete/free these? They don't get re-used I take it? Surely it wasn't intended that ib_post_recv() be initially run for each transfer expected in the lifetime of the connection. :)
There must also be some information about what is known about these cqe's. How do we know if one of them was used for a transfer from the server to the client or from the client letting the server know the transfer was recieved?
I know that this isn't a CM question; but this question is best asked against this code simplicity. (Simplicity is good)
Jeff _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
