Michael Buesch wrote:
> On Thursday 07 September 2006 20:17, Larry Finger wrote:
>> Hi all,
>>
>> I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx
>> timeouts and would like it
>> to get as much testing as possible as this bug affects V2.6.18-rcX. If the
>> problem is truly
>> fixed, I hope to get the fix into mainline before release of the bug into
>> the stable series.
>>
>> I got the idea for the fix when I discovered that the timeout interval used
>> for the watchdog is the
>> default value of 5 seconds. Obviously, the few milliseconds used in the
>> periodic work handler
>> weren't causing us to "just miss".
>>
>> To exacerbate the problem, I changed the repeat timer for periodic work from
>> 15 to 1 sec. I also set
>> BADNESS_LIMIT to 0. As a result, I was running the problem code once per
>> second instead of once per
>> minute. Now failures would occur in minutes instead of hours.
>>
>> Operating from the premise that the DMA needed some time to reach the idle
>> state after the MAC was
>> suspended, I tried various delays, but nothing worked.
>>
>> Then I decided to test the premise that the problem was associated with
>> shutting down and restarting
>> the network. That lead to the current patch, which has run for what is
>> effectively 100 times longer
>> than previous versions.
>>
>> Larry
>> -------------------------------------------------------------------
>>
>>
>> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> ===================================================================
>> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> @@ -3244,8 +3244,6 @@ static void bcm43xx_periodic_work_handle
>> * be preemtible.
>> */
>> mutex_lock(&bcm->mutex);
>> - netif_stop_queue(bcm->net_dev);
>> - synchronize_net();
>> spin_lock_irqsave(&bcm->irq_lock, flags);
>> bcm43xx_mac_suspend(bcm);
>> if (bcm43xx_using_pio(bcm))
>> @@ -3270,7 +3268,6 @@ static void bcm43xx_periodic_work_handle
>> if (bcm43xx_using_pio(bcm))
>> bcm43xx_pio_thaw_txqueues(bcm);
>> bcm43xx_mac_enable(bcm);
>> - netif_wake_queue(bcm->net_dev);
>> }
>> mmiowb();
>> spin_unlock_irqrestore(&bcm->irq_lock, flags);
>
> The real question is: Why does this patch help?
>
> Let's explain it. We don't stop networking just for fun there.
> While executing long preemptible periodic work, we must ensure
> that the TX path into the driver is not entered. It's the same
> reason why we disable IRQs in the first place. We can't take the
> mutex in the TX path and the IRQ handler. (That are the only places
> where we can't take the mutex).
> Short: We must stop netif here.
> The question is: Why does stopping netif queue cause a watchdog
> trigger here? The maximum time it can take for the periodic
> work inside of the critical section is about 0.2sec. So the queue
> is stopped for about 0.2sec max. Why does the watchdog trigger?
> Any idea from some networking guru?
> Could synchronize_net() take over 5sec in some worst case? Why?
> Questions over questions :D
This may be a stupid question, but does the synchronize_net call belong?
The reason I ask is because the code for synchronize_net is
void synchronize_net(void)
{
might_sleep();
synchronize_rcu();
}
When I look at synchronize_rcu, I find the following comment:
* synchronize_rcu - wait until a grace period has elapsed.
*
* Control will return to the caller some time after a full grace
* period has elapsed, in other words after all currently executing RCU
* read-side critical sections have completed. RCU read-side critical
* sections are delimited by rcu_read_lock() and rcu_read_unlock(),
* and may be nested.
*
* If your read-side code is not protected by rcu_read_lock(), do -not-
* use synchronize_rcu().
When I look through the mutex_lock code, I don't find any rcu code. What did I
miss? BTW, I still
got NETDEV tx timeouts with a 30 second timeout.
I'm currently testing with the netif_stop_queue and netif_wake_queue statements
restored, but
without the synchronize_net. It has run for over 11 hours with the accelerated
testing, which would
correspond to almost 4 weeks at regular rates.
Larry
_______________________________________________
Bcm43xx-dev mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev