Arthur, I just checked in a fix for bugzilla 1004, which seems to be the same problem you are seeing. (I just noticed your explanation in this thread in an earlier post: "So a call to ipoib_cm_send() with tx_outstanding = (ipoib_sendq_size - 2), followed by a call to ipoib_send() would get to a situation where the queue was full, but not stopped." ).
This is correct, and this was the bug (in addition to a missing invocation of netif_stop_queue in ipoib_ib_tx_timer_func() ). The patch uses the same value for tx_outstanding in all cases in the test for invoking netif_stop_queue(), so that there is no way the kernel will continue to send TX packets to IPoIB if the queue becomes too full. (using the same value in all tests creates a "barrier" with no holes). This patch will be part of OFED 1.3.1-rc2 -- and you should see no more mthca "queue full" messages. - Jack P.S., this fix is not needed in the upstream kernel, since the unsignalled UD send mechanism was not added upstream. On Sunday 11 May 2008 13:23, [EMAIL PROTECTED] wrote: > On Sun, May 11, 2008 at 11:18:19AM +0300, Eli Cohen wrote: > > .... > > The reason why the queue is stopped when there is one entry still left > > is to allow ipoib_ib_tx_timer_func() to post a special send request that > > will ensure a completion is reported for this operation thus freeing > > entries at the tx ring. I don't think the scenario you describe here can > > lead to a deadlock since if that happens, it will be released because of > > either one of the following two reasons: > > 1. If the tx queue contains not yet polled, more than one completion of > > send WRs posted by ipoib_cm_send(), they will soon be polled since they > > are posted to a signaled QP and sooner or later will generate > > completions and interrupts. In this case, subsequent postings to > > ipoib_send() will work as expected. > > > > 2. If there is only one outstanding ipoib_cm_send() at the tx queue, it > > means that there are 126 outstanding ipoib_send() requests at the tx > > queue and this means that a few of them are signaled and are expected to > > be completed soon. > > Thanks for the explanation. > > The main problem that we're seeing is that we just stop getting > completions for the send queue. (And we see this with OFED-1.2 > and 1.3, which makes me think that it's unlikely to be due to the > IPoIB driver since that's changed so much.) > > > ..... > > And last, could you arrange a remote access to a machine in this > > condition so we could check the state of the device/FW? > > > > Yes, I think so. Let me see if I can arrange that. > _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
