Re: RFC/T: Possible fix for bcm43xx periodic work bug

Larry Finger Fri, 08 Sep 2006 06:26:07 -0700

Michael Buesch wrote:
> On Thursday 07 September 2006 20:17, Larry Finger wrote:
>> Hi all,
>>
>> I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx 
>> timeouts and would like it
>> to get as much testing as possible as this bug affects V2.6.18-rcX. If the 
>> problem is truly
>> fixed, I hope to get the fix into mainline before release of the bug into 
>> the stable series.
>>
>> I got the idea for the fix when I discovered that the timeout interval used 
>> for the watchdog is the 
>> default value of 5 seconds. Obviously, the few milliseconds used in the 
>> periodic work handler 
>> weren't causing us to "just miss".
>>
>> To exacerbate the problem, I changed the repeat timer for periodic work from 
>> 15 to 1 sec. I also set 
>> BADNESS_LIMIT to 0. As a result, I was running the problem code once per 
>> second instead of once per 
>> minute. Now failures would occur in minutes instead of hours.
>>
>> Operating from the premise that the DMA needed some time to reach the idle 
>> state after the MAC was 
>> suspended, I tried various delays, but nothing worked.
>>
>> Then I decided to test the premise that the problem was associated with 
>> shutting down and restarting 
>> the network. That lead to the current patch, which has run for what is 
>> effectively 100 times longer 
>> than previous versions.
>>
>> Larry
>> -------------------------------------------------------------------
>>
>>
>> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> ===================================================================
>> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c
>> @@ -3244,8 +3244,6 @@ static void bcm43xx_periodic_work_handle
>>               * be preemtible.
>>               */
>>              mutex_lock(&bcm->mutex);
>> -            netif_stop_queue(bcm->net_dev);
>> -            synchronize_net();
>>              spin_lock_irqsave(&bcm->irq_lock, flags);
>>              bcm43xx_mac_suspend(bcm);
>>              if (bcm43xx_using_pio(bcm))
>> @@ -3270,7 +3268,6 @@ static void bcm43xx_periodic_work_handle
>>              if (bcm43xx_using_pio(bcm))
>>                      bcm43xx_pio_thaw_txqueues(bcm);
>>              bcm43xx_mac_enable(bcm);
>> -            netif_wake_queue(bcm->net_dev);
>>      }
>>      mmiowb();
>>      spin_unlock_irqrestore(&bcm->irq_lock, flags);
> 
> The real question is: Why does this patch help?
> 
> Let's explain it. We don't stop networking just for fun there.
> While executing long preemptible periodic work, we must ensure
> that the TX path into the driver is not entered. It's the same
> reason why we disable IRQs in the first place. We can't take the
> mutex in the TX path and the IRQ handler. (That are the only places
> where we can't take the mutex).
> Short: We must stop netif here.
> The question is: Why does stopping netif queue cause a watchdog
> trigger here? The maximum time it can take for the periodic
> work inside of the critical section is about 0.2sec. So the queue
> is stopped for about 0.2sec max. Why does the watchdog trigger?
> Any idea from some networking guru?
> Could synchronize_net() take over 5sec in some worst case? Why?
> Questions over questions :D


This may be a stupid question, but does the synchronize_net call belong?

The reason I ask is because the code for synchronize_net is

void synchronize_net(void)
{
         might_sleep();
         synchronize_rcu();
}

When I look at synchronize_rcu, I find the following comment:

* synchronize_rcu - wait until a grace period has elapsed.
  *
  * Control will return to the caller some time after a full grace
  * period has elapsed, in other words after all currently executing RCU
  * read-side critical sections have completed.  RCU read-side critical
  * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
  * and may be nested.
  *
  * If your read-side code is not protected by rcu_read_lock(), do -not-
  * use synchronize_rcu().

When I look through the mutex_lock code, I don't find any rcu code. What did I 
miss? BTW, I still 
got NETDEV tx timeouts with a 30 second timeout.

I'm currently testing with the netif_stop_queue and netif_wake_queue statements 
restored, but 
without the synchronize_net. It has run for over 11 hours with the accelerated 
testing, which would 
correspond to almost 4 weeks at regular rates.

Larry
_______________________________________________
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev

Re: RFC/T: Possible fix for bcm43xx periodic work bug

Reply via email to