Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-07 Thread Michael Buesch
On Thursday 07 September 2006 10:43, Bin Zhang wrote:
 On 9/5/06, Hendrik Sattler [EMAIL PROTECTED] wrote:
  Am Dienstag 05 September 2006 19:58 schrieb Larry Finger:
   Based on user reports and my own experiences, the current problems with
   NETDEV WATCHDOG tx timeouts, and the device just falling over do not 
   happen
   when periodic work is not preemptible. These problems seem to affect
   BCM4306 rev 2  3 chips. Since I changed BADNESS_LIMIT to 20 to disable
   preemption during periodic work, my device has stayed up continuously for
   more than 18 hours. Previously, the longest time between failures was less
   than 6 hours, and sometimes as short as 10 minutes.
 
  I have a BCM4306 rev 3, running with linux-2.6.17.x and have none of these
  problems.
 
 It seems to me that this problem appears only when you use 2.6.18.
 I had already twice this problem after more then 10 hours uptime :
 Sep  7 09:39:45 localhost kernel: bcm43xx: Controller restarted
 Sep  7 09:43:09 localhost syslogd 1.4.1#18: restart.
 
 With 2.6.17, I have never seen this problem.

That was not the question. 2.6.17 does not have preemptible work.

We need a way to trigger this more quick. It's rather hard to debug
something if it only triggers every 10 hours. Especially if we don't
know for sure what's going on.

The problem is _not_ the BADNESS_LIMIT. So raising this _won't_
help us. The problem is somewhere in the preemptible work card-
stopping-code. That is the code path triggered by a high badness.
A higher BADNESS_LIMIT _won't_ fix the code. It will only introduce
more bugs and make it even harder to debug.
So _lowering_ the BADNESS_LIMIT could actually be a way to
trigger it in reasonable time.
We could also try to fire periodic work more often.
Say once per second.

-- 
Greetings Michael.
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-07 Thread Martin Langer
On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote:
 Martin Langer wrote:
  Larry, IIRC your hardware is a 0x812 rev 4. This core will load a 
  different microcode (bcm43xx_microcode4.fw) than all later core 
  revisions. (I guess the pci revision number isn't usefull here.)
  
  Those old microcodes 2 and 4 seem to have a different instruction set in 
  their firmware than later ones. So I'm afraid that those old bcm43xx 
  cores are based on a different microprocessor. And if they are really 
  based on a different microprocessor then their drivers will probably 
  have different problems. But this theory will only fit if your problems 
  are limited to the 0x812 rev2/rev4 world. Just my $0.02.
 
 No, mine is a 4306 rev 2 card. 

Sure. But there are more rev numbers in your logfiles. And there has to 
be the revision of your wlan core 0x812, too. Look for a line like

bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled
   ^^^

Which microcode will be loaded depends only on that revision number. 

 The output of your new microcode version printout is Microcode rev 
 0x123, pl 0x21 (2005-01-22  19:48:06)

Yeah, the patch seems to work like expected. That's good.

Martin
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-07 Thread Martin Langer
On Thu, Sep 07, 2006 at 02:28:53PM +0200, Bin Zhang wrote:
 On 9/7/06, Martin Langer [EMAIL PROTECTED] wrote:
 On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote:
  Martin Langer wrote:
   Those old microcodes 2 and 4 seem to have a different instruction set 
 in
   their firmware than later ones. So I'm afraid that those old bcm43xx
   cores are based on a different microprocessor. And if they are really
   based on a different microprocessor then their drivers will probably
   have different problems. But this theory will only fit if your problems
   are limited to the 0x812 rev2/rev4 world. Just my $0.02.

 I have this problem. This is the line in my syslog :
 bcm43xx: Core 1: ID 0x812, rev 0x5, vendor 0x4243, disabled

Ok, this will load bcm43xx_microcode5.fw which is one of the newer ones.
So we can forget that idea. It can not related to those rare old ones 
only.

Martin

PS: Don't forget to send a copy of your mail to [EMAIL PROTECTED] 
next time.
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-07 Thread Larry Finger
Martin Langer wrote:
 On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote:
 
 Sure. But there are more rev numbers in your logfiles. And there has to 
 be the revision of your wlan core 0x812, too. Look for a line like
 
 bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled
^^^
 
 Which microcode will be loaded depends only on that revision number. 

You're right. My numbers are

bcm43xx: Core 0: ID 0x800, rev 0x2, vendor 0x4243, enabled
bcm43xx: Core 1: ID 0x812, rev 0x4, vendor 0x4243, disabled
bcm43xx: Core 2: ID 0x80d, rev 0x1, vendor 0x4243, enabled
bcm43xx: Core 3: ID 0x807, rev 0x1, vendor 0x4243, disabled
bcm43xx: Core 4: ID 0x804, rev 0x7, vendor 0x4243, enabled
bcm43xx: Core 5: ID 0x812, rev 0x4, vendor 0x4243, disabled

 
 The output of your new microcode version printout is Microcode rev 
 0x123, pl 0x21 (2005-01-22  19:48:06)
 
 Yeah, the patch seems to work like expected. That's good.

Sure does.

Larry

___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-07 Thread Bin Zhang
On 9/7/06, Martin Langer [EMAIL PROTECTED] wrote:
 On Thu, Sep 07, 2006 at 02:28:53PM +0200, Bin Zhang wrote:
  On 9/7/06, Martin Langer [EMAIL PROTECTED] wrote:
  On Wed, Sep 06, 2006 at 04:51:08PM -0500, Larry Finger wrote:
   Martin Langer wrote:
Those old microcodes 2 and 4 seem to have a different instruction set
  in
their firmware than later ones. So I'm afraid that those old bcm43xx
cores are based on a different microprocessor. And if they are really
based on a different microprocessor then their drivers will probably
have different problems. But this theory will only fit if your problems
are limited to the 0x812 rev2/rev4 world. Just my $0.02.

  I have this problem. This is the line in my syslog :
  bcm43xx: Core 1: ID 0x812, rev 0x5, vendor 0x4243, disabled

 Ok, this will load bcm43xx_microcode5.fw which is one of the newer ones.
 So we can forget that idea. It can not related to those rare old ones
 only.

grep bcm43xx: Core /var/log/syslog gives me other numbers

Sep  7 14:44:17 localhost kernel: bcm43xx: Core 0: ID 0x800, rev 0x4,
vendor 0x4243, enabled
Sep  7 14:44:17 localhost kernel: bcm43xx: Core 1: ID 0x812, rev 0x5,
vendor 0x4243, disabled
Sep  7 14:44:17 localhost kernel: bcm43xx: Core 2: ID 0x80d, rev 0x2,
vendor 0x4243, enabled
Sep  7 14:44:17 localhost kernel: bcm43xx: Core 3: ID 0x807, rev 0x2,
vendor 0x4243, disabled
Sep  7 14:44:17 localhost kernel: bcm43xx: Core 4: ID 0x804, rev 0x9,
vendor 0x4243, enabled


Thanks,
Bin

 Martin

 PS: Don't forget to send a copy of your mail to [EMAIL PROTECTED]
 next time.

___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-06 Thread Michael Buesch
On Wednesday 06 September 2006 09:36, Johannes Berg wrote:
 Michael,
 
  When a preemptible work happens, we completely shutdown IRQ
  handling and we suspend the MAC. We do this, because we must
  not take the IRQ spinlock if we want to be preemptible.
  By not taking the IRQ spinlock, we race against the DMA engine
  (and other parts). So we must shutdown any data flow during
  the periodic work to ensure the IRQ handler does not trigger.
  The sad thing is: We don't know much about how the card and
  the firmware works (yet). So the big question is:
  How to suspend the card in an easy and _inexpensive_ way?
  We currently mask all IRQs and suspend the MAC. I guess MAC
  suspending is part of the problem. I _guess_ the card is
  confused by suspending the MAC in the middle of possible
  transmissions. It's all just a guess. That's why I want to
  have a good way to reproduce the bug to do experiments.
  We could suspend the DMA TX channel before we suspend the MAC,
  for example. We could try other things as well. For example
  don't suspend the MAC at all. Just mask IRQs.
 
 I notice that later drivers say something like:
  * the MAC suspend is independent of DMA suspend
  * MAC suspend means that the MAC is suspended and won't tx/rx any
frames
  * due to the device having FIFO buffers, DMA may continue after a MAC
suspend until the buffers are full
  * the correct way to completely idle the card is to suspend the MAC
and then wait for DMA/PIO to suspend as well
 
 Will put that into the spec at some point too :)

Nice, thanks for the explaination.

-- 
Greetings Michael.
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-06 Thread Martin Langer
On Tue, Sep 05, 2006 at 08:23:09PM +0200, Hendrik Sattler wrote:
 Am Dienstag 05 September 2006 19:58 schrieb Larry Finger:
  Based on user reports and my own experiences, the current problems with
  NETDEV WATCHDOG tx timeouts, and the device just falling over do not happen
  when periodic work is not preemptible. These problems seem to affect
  BCM4306 rev 2  3 chips. 

 I have a BCM4306 rev 3, running with linux-2.6.17.x and have none of these 
 problems.

Larry, IIRC your hardware is a 0x812 rev 4. This core will load a 
different microcode (bcm43xx_microcode4.fw) than all later core 
revisions. (I guess the pci revision number isn't usefull here.)

Those old microcodes 2 and 4 seem to have a different instruction set in 
their firmware than later ones. So I'm afraid that those old bcm43xx 
cores are based on a different microprocessor. And if they are really 
based on a different microprocessor then their drivers will probably 
have different problems. But this theory will only fit if your problems 
are limited to the 0x812 rev2/rev4 world. Just my $0.02.

Martin
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-06 Thread Larry Finger
Martin Langer wrote:
 Larry, IIRC your hardware is a 0x812 rev 4. This core will load a 
 different microcode (bcm43xx_microcode4.fw) than all later core 
 revisions. (I guess the pci revision number isn't usefull here.)
 
 Those old microcodes 2 and 4 seem to have a different instruction set in 
 their firmware than later ones. So I'm afraid that those old bcm43xx 
 cores are based on a different microprocessor. And if they are really 
 based on a different microprocessor then their drivers will probably 
 have different problems. But this theory will only fit if your problems 
 are limited to the 0x812 rev2/rev4 world. Just my $0.02.

No, mine is a 4306 rev 2 card. The output of your new microcode version 
printout is Microcode rev 
0x123, pl 0x21 (2005-01-22  19:48:06)

Larry
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-05 Thread Hendrik Sattler
Am Dienstag 05 September 2006 19:58 schrieb Larry Finger:
 Based on user reports and my own experiences, the current problems with
 NETDEV WATCHDOG tx timeouts, and the device just falling over do not happen
 when periodic work is not preemptible. These problems seem to affect
 BCM4306 rev 2  3 chips. Since I changed BADNESS_LIMIT to 20 to disable
 preemption during periodic work, my device has stayed up continuously for
 more than 18 hours. Previously, the longest time between failures was less
 than 6 hours, and sometimes as short as 10 minutes.

I have a BCM4306 rev 3, running with linux-2.6.17.x and have none of these 
problems. My WLAN connection uptime is several days without any problem 
accept those dummy message from softmac:
TKIP: replay detected: STA=00:04:0e:90:76:e8 previous TSC 0894 
received TSC 0894

Same for CCMP and WEP/WPA or access point does not matter.

HS
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev


Re: [RFC] A change in periodic work scheduling in bcm43xx

2006-09-05 Thread Michael Buesch
On Tuesday 05 September 2006 19:58, Larry Finger wrote:
 Michael,
 
 Based on user reports and my own experiences, the current problems with 
 NETDEV WATCHDOG tx timeouts, 
 and the device just falling over do not happen when periodic work is not 
 preemptible. These problems 
 seem to affect BCM4306 rev 2  3 chips. Since I changed BADNESS_LIMIT to 20 
 to disable preemption 
 during periodic work, my device has stayed up continuously for more than 18 
 hours. Previously, the 
 longest time between failures was less than 6 hours, and sometimes as short 
 as 10 minutes.
 
 As you know, the present scheme for periodic work scheduling for bcm43xx in 
 both wireless-2.6 and 
 wireless-dev runs all 4 periodic tasks on certain ticks of the 15-second 
 clock. Using your values of 
 badness of 1, 1, 5, and 10 for the 15, 30, 60, and 120 second periodic 
 tasks, respectively, the 
 badness repeat cycle is ..., 1, 2, 1, 7, 1, 2, 1, 17, ...
 
 I propose that we reduce the size of the spike in badness by shifting the 120 
 second task from a 
 clock value of 8n to 8n+7, and the 60 second task from 4n to 4n+1. This way 
 no more than 2 of the 
 periodic tasks will be run in any clock period, and the badness repeat cycle 
 becomes ..., 6, 2, 1, 
 2, 6, 2, 11, 2,  The tasks are run with the same periodicity as before, 
 just a little more 
 asynchronously. I recall that they were completely asynchronous in early 
 versions of this driver.
 
 Until we can locate and fix the problem that occurs during preemption, should 
 we consider setting 
 BADNESS_LIMIT to 20 in the wireless-2.6 kernels? For those of us whose cards 
 have the problem, it 
 certainly makes the device a lot more usable.

Oh well...
And if we do this, it will take two weeks for the latency-people to
show up and request a revert of this again.

Well, I _really_ don't want to have a patch like this, because
it just papers over a real bug.
There are only two choices: Either we want preemption or we don't.
It's worthless to tune the badness limit to a point where it is least
likely for the bug to trigger. Sooner or later it _will_ trigger.

What we really want is:
1st: A relieable way to reproduce the bug in short time.
 Waiting 20hours isn't really a good way of debugging.
2nd: If we can reproduce it in reasonable time, we can track
 down what is actually causing the bug.

My thoughts on the bug:

When a preemptible work happens, we completely shutdown IRQ
handling and we suspend the MAC. We do this, because we must
not take the IRQ spinlock if we want to be preemptible.
By not taking the IRQ spinlock, we race against the DMA engine
(and other parts). So we must shutdown any data flow during
the periodic work to ensure the IRQ handler does not trigger.
The sad thing is: We don't know much about how the card and
the firmware works (yet). So the big question is:
How to suspend the card in an easy and _inexpensive_ way?
We currently mask all IRQs and suspend the MAC. I guess MAC
suspending is part of the problem. I _guess_ the card is
confused by suspending the MAC in the middle of possible
transmissions. It's all just a guess. That's why I want to
have a good way to reproduce the bug to do experiments.
We could suspend the DMA TX channel before we suspend the MAC,
for example. We could try other things as well. For example
don't suspend the MAC at all. Just mask IRQs.

We must be _careful_ here. The preemptible periodic work
is a damn fragile part of the whole driver and it is easily
possible to break it even more with a patch that looks
correct.

Short:
We don't need a patch to paper over the bug, but we need
_ideas_ of what is actually going on.

-- 
Greetings Michael.
___
Bcm43xx-dev mailing list
Bcm43xx-dev@lists.berlios.de
https://lists.berlios.de/mailman/listinfo/bcm43xx-dev