On Mon, Nov 19, 2007 at 08:29:36PM -0800, Roland Dreier wrote: > > OFED 1.2 uses a separate CQ for send completions in connected mode. > (I'm assuming you're using the OFED default of connected mode for > IPoIB). I guess it would be useful to know which CQ is overrunning, > ie whether it is the main IPoIB CQ or one of the CM send CQs. One way > to check this would be to add a print to mthca to dump the CQN when a > CQ is created, and also add prints to IPoIB just before each call to > ib_create_cq() so that the CQNs can be correlated. > > Another thing you could try would be a 2.6.24-rc kernel (or an OFED > 1.3 prerelease I guess), which has a change that moves all completions > into one CQ in IPoIB. This may fix the bug by accident. >
Yes, we're using CM. I dumped out the CQNs as they were created and generally the first non-reserved CQs get made by ipoib_transport_dev_init() when ipoib is brought up on each port. CQN 0x80 is used by port 0, 0x81 by port 1. The other CQs used by IPoIB are the ones made by ipoib_cm_tx_init(). We see overruns on both types of CQ. Here's an overrun on the main IPoIB CQ (CQN 0x80): Dec 2 10:18:08 r6i1n8 kernel: ib0: Send unicast ARP to 0165 Dec 2 10:18:13 r6i1n8 kernel: ib1: Send unicast ARP to 016d Dec 2 10:18:28 r6i1n8 kernel: ib0: Send unicast ARP to 0165 Dec 2 10:18:39 r6i1n8 kernel: ib0: Send unicast ARP to 010a Dec 2 10:18:48 r6i1n8 kernel: ib0: Send unicast ARP to 0165 Dec 2 10:19:08 r6i1n8 kernel: ib0: Send unicast ARP to 0165 Dec 2 10:19:13 r6i1n8 kernel: ib1: Send unicast ARP to 016d Dec 2 10:19:23 r6i1n8 kernel: ib0: Send unicast ARP to 016a Dec 2 10:19:23 r6i1n8 kernel: ib_mthca 0000:06:00.0: CQ overrun on CQN 000080 Dec 2 10:19:23 r6i1n8 kernel: ib_mad: Fatal error (1) on MAD QP (1) Dec 2 10:19:23 r6i1n8 kernel: cq_context = 0xffff8101b0ec1000 Dec 2 10:19:23 r6i1n8 kernel: flags = 0x90000900 Dec 2 10:19:23 r6i1n8 kernel: start_hi = 0x0 Dec 2 10:19:23 r6i1n8 kernel: start_lo = 0x0 Dec 2 10:19:23 r6i1n8 kernel: logsize_usrpage = 0xb000002 Dec 2 10:19:23 r6i1n8 kernel: comp_eqn = 0x1 Dec 2 10:19:23 r6i1n8 kernel: pd = 0x4 Dec 2 10:19:23 r6i1n8 kernel: lkey = 0x1300 Dec 2 10:19:23 r6i1n8 kernel: last_notified_index = 0x6972 Dec 2 10:19:23 r6i1n8 kernel: solicit_producer_index = 0x6173 Dec 2 10:19:23 r6i1n8 kernel: consumer_index = 0x0 Dec 2 10:19:23 r6i1n8 kernel: producer_index = 0x6973 Dec 2 10:19:23 r6i1n8 kernel: cqn = 0x80 Dec 2 10:19:23 r6i1n8 kernel: ci_db = 0x7fff Dec 2 10:19:23 r6i1n8 kernel: state_db = 0x0 Dec 2 10:19:28 r6i1n8 kernel: ib0: Send unicast ARP to 0165 Dec 2 10:19:48 r6i1n8 kernel: ib0: Send unicast ARP to 0165 Dec 2 10:19:57 r6i1n8 kernel: ib_mad: Fatal error (1) on MAD QP (1) (The CQ context table was dumped for debugging.) And there was an example of a CM send CQ overrun in the mail I just sent to Eli (and ofa-general). > Another thing you could try would be a 2.6.24-rc kernel (or an OFED > 1.3 prerelease I guess), which has a change that moves all completions > into one CQ in IPoIB. This may fix the bug by accident. The system was upgraded to OFED 1.3-alpha2, and now it's much more difficult to get the CQ overrun. (There are some overruns in the log files, but I can't seem to figure out how to reproduce them - it was much easier to get the CQ overruns with OFED 1.2 on the system.) -- Arthur _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
