Michael Buesch wrote: > On Thursday 07 September 2006 20:17, Larry Finger wrote: >> Hi all, >> >> I think I have a fix for the bcm43xx bug that leads to NETDEV WATCHDOG tx >> timeouts and would like it >> to get as much testing as possible as this bug affects V2.6.18-rcX. If the >> problem is truly >> fixed, I hope to get the fix into mainline before release of the bug into >> the stable series. >> >> I got the idea for the fix when I discovered that the timeout interval used >> for the watchdog is the >> default value of 5 seconds. Obviously, the few milliseconds used in the >> periodic work handler >> weren't causing us to "just miss". >> >> To exacerbate the problem, I changed the repeat timer for periodic work from >> 15 to 1 sec. I also set >> BADNESS_LIMIT to 0. As a result, I was running the problem code once per >> second instead of once per >> minute. Now failures would occur in minutes instead of hours. >> >> Operating from the premise that the DMA needed some time to reach the idle >> state after the MAC was >> suspended, I tried various delays, but nothing worked. >> >> Then I decided to test the premise that the problem was associated with >> shutting down and restarting >> the network. That lead to the current patch, which has run for what is >> effectively 100 times longer >> than previous versions. >> >> Larry >> ------------------------------------------------------------------- >> >> >> Index: wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c >> =================================================================== >> --- wireless-2.6.orig/drivers/net/wireless/bcm43xx/bcm43xx_main.c >> +++ wireless-2.6/drivers/net/wireless/bcm43xx/bcm43xx_main.c >> @@ -3244,8 +3244,6 @@ static void bcm43xx_periodic_work_handle >> * be preemtible. >> */ >> mutex_lock(&bcm->mutex); >> - netif_stop_queue(bcm->net_dev); >> - synchronize_net(); >> spin_lock_irqsave(&bcm->irq_lock, flags); >> bcm43xx_mac_suspend(bcm); >> if (bcm43xx_using_pio(bcm)) >> @@ -3270,7 +3268,6 @@ static void bcm43xx_periodic_work_handle >> if (bcm43xx_using_pio(bcm)) >> bcm43xx_pio_thaw_txqueues(bcm); >> bcm43xx_mac_enable(bcm); >> - netif_wake_queue(bcm->net_dev); >> } >> mmiowb(); >> spin_unlock_irqrestore(&bcm->irq_lock, flags); > > The real question is: Why does this patch help? > > Let's explain it. We don't stop networking just for fun there. > While executing long preemptible periodic work, we must ensure > that the TX path into the driver is not entered. It's the same > reason why we disable IRQs in the first place. We can't take the > mutex in the TX path and the IRQ handler. (That are the only places > where we can't take the mutex). > Short: We must stop netif here. > The question is: Why does stopping netif queue cause a watchdog > trigger here? The maximum time it can take for the periodic > work inside of the critical section is about 0.2sec. So the queue > is stopped for about 0.2sec max. Why does the watchdog trigger? > Any idea from some networking guru? > Could synchronize_net() take over 5sec in some worst case? Why? > Questions over questions :D
This may be a stupid question, but does the synchronize_net call belong? The reason I ask is because the code for synchronize_net is void synchronize_net(void) { might_sleep(); synchronize_rcu(); } When I look at synchronize_rcu, I find the following comment: * synchronize_rcu - wait until a grace period has elapsed. * * Control will return to the caller some time after a full grace * period has elapsed, in other words after all currently executing RCU * read-side critical sections have completed. RCU read-side critical * sections are delimited by rcu_read_lock() and rcu_read_unlock(), * and may be nested. * * If your read-side code is not protected by rcu_read_lock(), do -not- * use synchronize_rcu(). When I look through the mutex_lock code, I don't find any rcu code. What did I miss? BTW, I still got NETDEV tx timeouts with a 30 second timeout. I'm currently testing with the netif_stop_queue and netif_wake_queue statements restored, but without the synchronize_net. It has run for over 11 hours with the accelerated testing, which would correspond to almost 4 weeks at regular rates. Larry _______________________________________________ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev