Re: [RFC] A change in periodic work scheduling in bcm43xx
On Thursday 07 September 2006 10:43, Bin Zhang wrote: On 9/5/06, Hendrik Sattler [EMAIL PROTECTED] wrote: Am Dienstag 05 September 2006 19:58 schrieb Larry Finger: Based on user reports and my own experiences, the current problems with NETDEV WATCHDOG tx timeouts, and the device just falling over do not happen when periodic work is not preemptible. These problems seem to affect BCM4306 rev 2 3 chips. Since I changed BADNESS_LIMIT to 20 to disable preemption during periodic work, my device has stayed up continuously for more than 18 hours. Previously, the longest time between failures was less than 6 hours, and sometimes as short as 10 minutes. I have a BCM4306 rev 3, running with linux-2.6.17.x and have none of these problems. It seems to me that this problem appears only when you use 2.6.18. I had already twice this problem after more then 10 hours uptime : Sep 7 09:39:45 localhost kernel: bcm43xx: Controller restarted Sep 7 09:43:09 localhost syslogd 1.4.1#18: restart. With 2.6.17, I have never seen this problem. That was not the question. 2.6.17 does not have preemptible work. We need a way to trigger this more quick. It's rather hard to debug something if it only triggers every 10 hours. Especially if we don't know for sure what's going on. The problem is _not_ the BADNESS_LIMIT. So raising this _won't_ help us. The problem is somewhere in the preemptible work card- stopping-code. That is the code path triggered by a high badness. A higher BADNESS_LIMIT _won't_ fix the code. It will only introduce more bugs and make it even harder to debug. So _lowering_ the BADNESS_LIMIT could actually be a way to trigger it in reasonable time. We could also try to fire periodic work more often. Say once per second. -- Greetings Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote: Martin Langer wrote: Larry, IIRC your hardware is a 0x812 rev 4. This core will load a different microcode (bcm43xx_microcode4.fw) than all later core revisions. (I guess the pci revision number isn't usefull here.) Those old microcodes 2 and 4 seem to have a different instruction set in their firmware than later ones. So I'm afraid that those old bcm43xx cores are based on a different microprocessor. And if they are really based on a different microprocessor then their drivers will probably have different problems. But this theory will only fit if your problems are limited to the 0x812 rev2/rev4 world. Just my $0.02. No, mine is a 4306 rev 2 card. Sure. But there are more rev numbers in your logfiles. And there has to be the revision of your wlan core 0x812, too. Look for a line like bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled ^^^ Which microcode will be loaded depends only on that revision number. The output of your new microcode version printout is Microcode rev 0x123, pl 0x21 (2005-01-22 19:48:06) Yeah, the patch seems to work like expected. That's good. Martin ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
On Thu, Sep 07, 2006 at 02:28:53PM +0200, Bin Zhang wrote: On 9/7/06, Martin Langer [EMAIL PROTECTED] wrote: On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote: Martin Langer wrote: Those old microcodes 2 and 4 seem to have a different instruction set in their firmware than later ones. So I'm afraid that those old bcm43xx cores are based on a different microprocessor. And if they are really based on a different microprocessor then their drivers will probably have different problems. But this theory will only fit if your problems are limited to the 0x812 rev2/rev4 world. Just my $0.02. I have this problem. This is the line in my syslog : bcm43xx: Core 1: ID 0x812, rev 0x5, vendor 0x4243, disabled Ok, this will load bcm43xx_microcode5.fw which is one of the newer ones. So we can forget that idea. It can not related to those rare old ones only. Martin PS: Don't forget to send a copy of your mail to [EMAIL PROTECTED] next time. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
Martin Langer wrote: On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote: Sure. But there are more rev numbers in your logfiles. And there has to be the revision of your wlan core 0x812, too. Look for a line like bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled ^^^ Which microcode will be loaded depends only on that revision number. You're right. My numbers are bcm43xx: Core 0: ID 0x800, rev 0x2, vendor 0x4243, enabled bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled bcm43xx: Core 2: ID 0x80d, rev 0x1, vendor 0x4243, enabled bcm43xx: Core 3: ID 0x807, rev 0x1, vendor 0x4243, disabled bcm43xx: Core 4: ID 0x804, rev 0x7, vendor 0x4243, enabled bcm43xx: Core 5: ID 0x812, rev 0x4, vendor 0x4243, disabled The output of your new microcode version printout is Microcode rev 0x123, pl 0x21 (2005-01-22 19:48:06) Yeah, the patch seems to work like expected. That's good. Sure does. Larry ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
On 9/7/06, Martin Langer [EMAIL PROTECTED] wrote: On Thu, Sep 07, 2006 at 02:28:53PM +0200, Bin Zhang wrote: On 9/7/06, Martin Langer [EMAIL PROTECTED] wrote: On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote: Martin Langer wrote: Those old microcodes 2 and 4 seem to have a different instruction set in their firmware than later ones. So I'm afraid that those old bcm43xx cores are based on a different microprocessor. And if they are really based on a different microprocessor then their drivers will probably have different problems. But this theory will only fit if your problems are limited to the 0x812 rev2/rev4 world. Just my $0.02. I have this problem. This is the line in my syslog : bcm43xx: Core 1: ID 0x812, rev 0x5, vendor 0x4243, disabled Ok, this will load bcm43xx_microcode5.fw which is one of the newer ones. So we can forget that idea. It can not related to those rare old ones only. grep bcm43xx: Core /var/log/syslog gives me other numbers Sep 7 14:44:17 localhost kernel: bcm43xx: Core 0: ID 0x800, rev 0x4, vendor 0x4243, enabled Sep 7 14:44:17 localhost kernel: bcm43xx: Core 1: ID 0x812, rev 0x5, vendor 0x4243, disabled Sep 7 14:44:17 localhost kernel: bcm43xx: Core 2: ID 0x80d, rev 0x2, vendor 0x4243, enabled Sep 7 14:44:17 localhost kernel: bcm43xx: Core 3: ID 0x807, rev 0x2, vendor 0x4243, disabled Sep 7 14:44:17 localhost kernel: bcm43xx: Core 4: ID 0x804, rev 0x9, vendor 0x4243, enabled Thanks, Bin Martin PS: Don't forget to send a copy of your mail to [EMAIL PROTECTED] next time. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
On Wednesday 06 September 2006 09:36, Johannes Berg wrote: Michael, When a preemptible work happens, we completely shutdown IRQ handling and we suspend the MAC. We do this, because we must not take the IRQ spinlock if we want to be preemptible. By not taking the IRQ spinlock, we race against the DMA engine (and other parts). So we must shutdown any data flow during the periodic work to ensure the IRQ handler does not trigger. The sad thing is: We don't know much about how the card and the firmware works (yet). So the big question is: How to suspend the card in an easy and _inexpensive_ way? We currently mask all IRQs and suspend the MAC. I guess MAC suspending is part of the problem. I _guess_ the card is confused by suspending the MAC in the middle of possible transmissions. It's all just a guess. That's why I want to have a good way to reproduce the bug to do experiments. We could suspend the DMA TX channel before we suspend the MAC, for example. We could try other things as well. For example don't suspend the MAC at all. Just mask IRQs. I notice that later drivers say something like: * the MAC suspend is independent of DMA suspend * MAC suspend means that the MAC is suspended and won't tx/rx any frames * due to the device having FIFO buffers, DMA may continue after a MAC suspend until the buffers are full * the correct way to completely idle the card is to suspend the MAC and then wait for DMA/PIO to suspend as well Will put that into the spec at some point too :) Nice, thanks for the explaination. -- Greetings Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
On Tue, Sep 05, 2006 at 08:23:09PM +0200, Hendrik Sattler wrote: Am Dienstag 05 September 2006 19:58 schrieb Larry Finger: Based on user reports and my own experiences, the current problems with NETDEV WATCHDOG tx timeouts, and the device just falling over do not happen when periodic work is not preemptible. These problems seem to affect BCM4306 rev 2 3 chips. I have a BCM4306 rev 3, running with linux-2.6.17.x and have none of these problems. Larry, IIRC your hardware is a 0x812 rev 4. This core will load a different microcode (bcm43xx_microcode4.fw) than all later core revisions. (I guess the pci revision number isn't usefull here.) Those old microcodes 2 and 4 seem to have a different instruction set in their firmware than later ones. So I'm afraid that those old bcm43xx cores are based on a different microprocessor. And if they are really based on a different microprocessor then their drivers will probably have different problems. But this theory will only fit if your problems are limited to the 0x812 rev2/rev4 world. Just my $0.02. Martin ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
Martin Langer wrote: Larry, IIRC your hardware is a 0x812 rev 4. This core will load a different microcode (bcm43xx_microcode4.fw) than all later core revisions. (I guess the pci revision number isn't usefull here.) Those old microcodes 2 and 4 seem to have a different instruction set in their firmware than later ones. So I'm afraid that those old bcm43xx cores are based on a different microprocessor. And if they are really based on a different microprocessor then their drivers will probably have different problems. But this theory will only fit if your problems are limited to the 0x812 rev2/rev4 world. Just my $0.02. No, mine is a 4306 rev 2 card. The output of your new microcode version printout is Microcode rev 0x123, pl 0x21 (2005-01-22 19:48:06) Larry ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
Am Dienstag 05 September 2006 19:58 schrieb Larry Finger: Based on user reports and my own experiences, the current problems with NETDEV WATCHDOG tx timeouts, and the device just falling over do not happen when periodic work is not preemptible. These problems seem to affect BCM4306 rev 2 3 chips. Since I changed BADNESS_LIMIT to 20 to disable preemption during periodic work, my device has stayed up continuously for more than 18 hours. Previously, the longest time between failures was less than 6 hours, and sometimes as short as 10 minutes. I have a BCM4306 rev 3, running with linux-2.6.17.x and have none of these problems. My WLAN connection uptime is several days without any problem accept those dummy message from softmac: TKIP: replay detected: STA=00:04:0e:90:76:e8 previous TSC 0894 received TSC 0894 Same for CCMP and WEP/WPA or access point does not matter. HS ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev
Re: [RFC] A change in periodic work scheduling in bcm43xx
On Tuesday 05 September 2006 19:58, Larry Finger wrote: Michael, Based on user reports and my own experiences, the current problems with NETDEV WATCHDOG tx timeouts, and the device just falling over do not happen when periodic work is not preemptible. These problems seem to affect BCM4306 rev 2 3 chips. Since I changed BADNESS_LIMIT to 20 to disable preemption during periodic work, my device has stayed up continuously for more than 18 hours. Previously, the longest time between failures was less than 6 hours, and sometimes as short as 10 minutes. As you know, the present scheme for periodic work scheduling for bcm43xx in both wireless-2.6 and wireless-dev runs all 4 periodic tasks on certain ticks of the 15-second clock. Using your values of badness of 1, 1, 5, and 10 for the 15, 30, 60, and 120 second periodic tasks, respectively, the badness repeat cycle is ..., 1, 2, 1, 7, 1, 2, 1, 17, ... I propose that we reduce the size of the spike in badness by shifting the 120 second task from a clock value of 8n to 8n+7, and the 60 second task from 4n to 4n+1. This way no more than 2 of the periodic tasks will be run in any clock period, and the badness repeat cycle becomes ..., 6, 2, 1, 2, 6, 2, 11, 2, The tasks are run with the same periodicity as before, just a little more asynchronously. I recall that they were completely asynchronous in early versions of this driver. Until we can locate and fix the problem that occurs during preemption, should we consider setting BADNESS_LIMIT to 20 in the wireless-2.6 kernels? For those of us whose cards have the problem, it certainly makes the device a lot more usable. Oh well... And if we do this, it will take two weeks for the latency-people to show up and request a revert of this again. Well, I _really_ don't want to have a patch like this, because it just papers over a real bug. There are only two choices: Either we want preemption or we don't. It's worthless to tune the badness limit to a point where it is least likely for the bug to trigger. Sooner or later it _will_ trigger. What we really want is: 1st: A relieable way to reproduce the bug in short time. Waiting 20hours isn't really a good way of debugging. 2nd: If we can reproduce it in reasonable time, we can track down what is actually causing the bug. My thoughts on the bug: When a preemptible work happens, we completely shutdown IRQ handling and we suspend the MAC. We do this, because we must not take the IRQ spinlock if we want to be preemptible. By not taking the IRQ spinlock, we race against the DMA engine (and other parts). So we must shutdown any data flow during the periodic work to ensure the IRQ handler does not trigger. The sad thing is: We don't know much about how the card and the firmware works (yet). So the big question is: How to suspend the card in an easy and _inexpensive_ way? We currently mask all IRQs and suspend the MAC. I guess MAC suspending is part of the problem. I _guess_ the card is confused by suspending the MAC in the middle of possible transmissions. It's all just a guess. That's why I want to have a good way to reproduce the bug to do experiments. We could suspend the DMA TX channel before we suspend the MAC, for example. We could try other things as well. For example don't suspend the MAC at all. Just mask IRQs. We must be _careful_ here. The preemptible periodic work is a damn fragile part of the whole driver and it is easily possible to break it even more with a patch that looks correct. Short: We don't need a patch to paper over the bug, but we need _ideas_ of what is actually going on. -- Greetings Michael. ___ Bcm43xx-dev mailing list Bcm43xx-dev@lists.berlios.de https://lists.berlios.de/mailman/listinfo/bcm43xx-dev