On Tue, Apr 29, 2008 at 02:49:37PM -0700, Roland Dreier wrote: > By the way, this isn't just theoretical -- I'm not smart enough to > realize this except that I just saw: > > ib1: TX ring full, stopping kernel net queue > NETDEV WATCHDOG: ib1: transmit timed out > ib1: transmit timeout: latency 1240 msecs > ib1: queue stopped 1, tx_head 5291313, tx_tail 5291255 >
It's very interesting to me that you mention this. I'm in the midst of debugging a similar problem, but with IPoIB circa OFED 1.2. Found 2 problems: 1) In connected mode it's possible to get into a situation where one (or more) IPoIB-CM send queues fill up (no completions ever happen for them for some reason), while all the other CM send queues are empty. Of course the empty TX queues don't generate completions either, so nothing ever restarts the xmit queue and one bad connection kills IPoIB. We have had IPoIB stuck "forever" in this situation. Simple, brutal fix is to do ipoib_flush_paths() in ipoib_timeout(). 2) We also see situations very similar to what you describe above. The IPoIB-UD send queue fills and never restarts. (Of course it's nothing to do with the patch that was being discussed in this thread, this is with OFED 1.2-rc2, and also OFED 1.2.) I don't see how case (2) is possible with circa OFED 1.2 code. Can anyone clue me in? -- Arthur _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
