In an earlier email I mentioned that, with certain workloads, we are seeing an endless loop of timeouts on the IPoIB-UD send queue. Messages like "NETDEV WATCHDOG: ib0: transmit timed out" appear once a second until the driver is unloaded. That was with OFED 1.2.
Using OFED 1.3, we see what I believe is the same problem, but it looks a little different. We don't get "NETDEV WATCHDOG", but we get an endless string of "post_send failed". (I suspect, but haven't verified, that the difference is due to the sharing of ipoib_dev_priv's tx_outstanding member between the UD and CM IPoIB QPs, the value of tx_outstanding is used to determine when to call netif_stop_queue().) The h/w is MT25204, with f/w version 1.2.0, on an x86_64. I instrumented the mthca driver to maintain a cicular buffer of the state of the IPoIB-UD send queue on each call to the "post_send" (mthca_arbel_post_send) and "poll_cq" (mthca_poll_one) routines, and also to dump the QP and CQ context when the full queue is detected. At some point, we just stop getting completions on the send queue. Here are the last entries from the "poll_cq" log: # jiffies qpn last head tail # comp ..... 0x100032cdc 0x404 0x49 0x24b 0x24a 0x100032cdc 0x404 0x4a 0x24b 0x24b 0x100033eed 0x404 0x4c 0x24e 0x24d 0x100033eed 0x404 0x4d 0x24e 0x24e 0x10003b594 0x404 0x4f 0x251 0x250 0x10003b594 0x404 0x50 0x251 0x251 0x10003c999 0x404 0x52 0x254 0x253 0x10003ca16 0x404 0x53 0x255 0x254 0x10003ca93 0x404 0x54 0x256 0x255 0x10003ca93 0x404 0x55 0x256 0x256 We keep calling the send routine (apparently via the periodic ipoib_ib_tx_timer_func()) and keep getting a "queue full" condition - the send queue length is 128. Here are some entries after the queue has filled (they keep going "forever"): # jiffies qpn last head tail # comp ..... 0x1000760dd 0x404 0x55 0x2d6 0x256 0x1000761c6 0x404 0x55 0x2d6 0x256 0x1000761d7 0x404 0x55 0x2d6 0x256 0x1000762c0 0x404 0x55 0x2d6 0x256 0x1000762d1 0x404 0x55 0x2d6 0x256 0x1000763ba 0x404 0x55 0x2d6 0x256 And here's the QP and CQ context immediately after the first post_send failure: QP context (including the 2-32 bit "opt_param_mask" and reserved fields at the beginning): [00] 0x00000000 0x00000000 0x30031900 0xef3e3f16 [10] 0x8b423b00 0x00000002 0x00000404 0x00000000 [20] 0x00000000 0x00000000 0x01000000 0x60000000 [30] 0x00000000 0x00000000 0x00000000 0x00000000 [40] 0x00000000 0x00000000 0x00000000 0x00000000 [50] 0x00000000 0x00000000 0x00000000 0x00000000 [60] 0x00000000 0x00000000 0x00000000 0x00000006 [70] 0x00000000 0x00002600 0xaf004000 0x00800088 [80] 0x00000256 0x00000082 0x00004000 0x00000005 [90] 0x00ffffff 0x00000257 0x00000008 0x003a277f [a0] 0x25020200 0x00000081 0x00000000 0x00007ff9 [b0] 0x00000b1b 0x00000000 0x000003f8 0x03f80256 [c0] 0x00000000 0x00000000 0x00000000 0x00000000 [d0] 0x00000000 0x00000000 0x00000000 0x00000000 [e0] 0x00000000 0x00000000 0x00000000 0x00000000 [f0] 0x00000000 0x00000000 0x00000000 0x00000000 CQ context: [00] 0x00000a00 0x00000000 0x00000000 0x08000002 [10] 0x00000000 0x00000001 0x00000004 0x00002500 [20] 0x000001fd 0x000001fd 0x00000000 0x00000238 [30] 0x00000082 0x00007ffa 0x00000004 0x00000000 I don't see anything obviously wrong here - anyone at Mellanox? Any idea why the card would stop generating TX completions? -- Arthur _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
