At a customer site running OFED 1.2 we are seeing the following - after ~10s of hours of stressing IPoIB, the card apparently stops generating TX completions. (These are MT25204 cards in x86_64 boxes, and we've seen this with a couple f/w versions, including the latest.)
We get something like: kernel: NETDEV WATCHDOG: ib0: transmit timed out kernel: ib0: transmit timeout: latency 1972 msecs kernel: ib0: queue stopped 1, tx_head 3271, tx_tail 3207 and that repeats "forever". And to simplify things, we can produce this behavior in datagram mode. As long as only datagram mode is in use, the TX code in the IPoIB driver seems quite straightforward. The only reason I can imagine that we'd fail to get a timely TX completion would be if link-level flow control were to throttle us. And I'd expect that to be a transient condition... Am I ovelooking something? Anyone seen similar? Suggestions for debugging? -- Arthur _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
