Re: My problems with stability on -current

2011-07-04 Thread Doug Barton

On 05/11/2011 04:33, Alexander Motin wrote:

On 11.05.2011 08:17, Doug Barton wrote:

I had an interesting result doing nothing but switching from HPET to
LAPIC ... no crash. Still on the same version of -current (r221566) the
only thing I've done is to add kern.eventtimer.timer=LAPIC to
/boot/loader.conf, and so far I haven't been able to get it to crash no
matter how much I compile, or how much other stuff I do in the
background. I _can_ get the system heavily loaded enough so that the
mouse can drag across the screen, windows take visible time to repaint,
etc. That happens with a load average of 4+ on this core 2 duo. But
other than that (which is not altogether unreasonable) the system has
been very stable for a couple of days now.

Does that suggest a next step in terms of what to test?


The fact that LAPIC is working fine can mean that problem is either HPET
specific or non-per-CPU timers specific. To check that you could try to
use i8254 timer in one-shot mode:
hint.attimer.0.timecounter=0
kern.eventtimer.timer=i8254

, or use HPET in per-CPU mode:
hint.atrtc.0.clock=0
hint.attimer.0.clock=0
hint.hpet.X.legacy_route=1

But the most informative would be to see what's going on with HPET
interrupts during the freezes. With HPET hardware it is very easy to
loose interrupt. And the lost interrupt means problem for many things.
There are some workarounds made for that, but I can't be sure. For that
case you could experiment with this patch:
--- acpi_hpet.c.prev 2010-12-25 11:28:45.0 +0200
+++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300
@@ -190,7 +190,7 @@ restart:
bus_write_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num),
t-next);
}
- if (fdiv  5000) {
+ if (1 || fdiv  5000) {
bus_read_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num));
now = bus_read_4(sc-mem_res, HPET_MAIN_COUNTER);


FYI, I have been running this patch since you sent it, and haven't 
crashed under high load since.


--

Nothin' ever doesn't change, but nothin' changes much.
-- OK Go

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price.  :)  http://SupersetSolutions.com/

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-12 Thread Doug Barton

On 05/11/2011 04:33, Alexander Motin wrote:

On 11.05.2011 08:17, Doug Barton wrote:

I had an interesting result doing nothing but switching from HPET to
LAPIC ... no crash. Still on the same version of -current (r221566) the
only thing I've done is to add kern.eventtimer.timer=LAPIC to
/boot/loader.conf, and so far I haven't been able to get it to crash no
matter how much I compile, or how much other stuff I do in the
background. I _can_ get the system heavily loaded enough so that the
mouse can drag across the screen, windows take visible time to repaint,
etc. That happens with a load average of 4+ on this core 2 duo. But
other than that (which is not altogether unreasonable) the system has
been very stable for a couple of days now.

Does that suggest a next step in terms of what to test?


The fact that LAPIC is working fine can mean that problem is either HPET
specific or non-per-CPU timers specific. To check that you could try to
use i8254 timer in one-shot mode:
hint.attimer.0.timecounter=0
kern.eventtimer.timer=i8254

, or use HPET in per-CPU mode:
hint.atrtc.0.clock=0
hint.attimer.0.clock=0
hint.hpet.X.legacy_route=1

But the most informative would be to see what's going on with HPET
interrupts during the freezes. With HPET hardware it is very easy to
loose interrupt. And the lost interrupt means problem for many things.
There are some workarounds made for that, but I can't be sure. For that
case you could experiment with this patch:
--- acpi_hpet.c.prev 2010-12-25 11:28:45.0 +0200
+++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300
@@ -190,7 +190,7 @@ restart:
bus_write_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num),
t-next);
}
- if (fdiv  5000) {
+ if (1 || fdiv  5000) {
bus_read_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num));
now = bus_read_4(sc-mem_res, HPET_MAIN_COUNTER);


Ok, I'll try the patch sometime soon, lots going on right now. FYI, I 
had something odd happen tonight, the laptop had been up for about 36 
hours, and it was idle for a while when I was afk for about an hour. 
When I came back, the system was off. Nothing in the logs, no core dump, 
but it definitely crashed because when I turned it back on the file 
systems were all dirty. This is still r221566 running LAPIC.


Interestingly I had pidgin running while it was idle, and a friend sent 
me an e-mail saying that he tried to IM me and as soon as he sent the 
message my status went from away to off line. The time he sent the 
e-mail corresponds roughly to the last entry in the log before I 
rebooted it. I realize that this is not a lot to go on, but I thought 
I'd mention it.



Doug

--

Nothin' ever doesn't change, but nothin' changes much.
-- OK Go

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price.  :)  http://SupersetSolutions.com/

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-11 Thread Alexander Motin

On 11.05.2011 08:17, Doug Barton wrote:

I had an interesting result doing nothing but switching from HPET to
LAPIC ... no crash. Still on the same version of -current (r221566) the
only thing I've done is to add kern.eventtimer.timer=LAPIC to
/boot/loader.conf, and so far I haven't been able to get it to crash no
matter how much I compile, or how much other stuff I do in the
background. I _can_ get the system heavily loaded enough so that the
mouse can drag across the screen, windows take visible time to repaint,
etc. That happens with a load average of 4+ on this core 2 duo. But
other than that (which is not altogether unreasonable) the system has
been very stable for a couple of days now.

Does that suggest a next step in terms of what to test?


The fact that LAPIC is working fine can mean that problem is either HPET 
specific or non-per-CPU timers specific. To check that you could try to 
use i8254 timer in one-shot mode:

hint.attimer.0.timecounter=0
kern.eventtimer.timer=i8254

, or use HPET in per-CPU mode:
hint.atrtc.0.clock=0
hint.attimer.0.clock=0
hint.hpet.X.legacy_route=1

But the most informative would be to see what's going on with HPET 
interrupts during the freezes. With HPET hardware it is very easy to 
loose interrupt. And the lost interrupt means problem for many things. 
There are some workarounds made for that, but I can't be sure. For that 
case you could experiment with  this patch:

--- acpi_hpet.c.prev2010-12-25 11:28:45.0 +0200
+++ acpi_hpet.c 2011-05-11 14:30:59.0 +0300
@@ -190,7 +190,7 @@ restart:
bus_write_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num),
t-next);
}
-   if (fdiv  5000) {
+   if (1 || fdiv  5000) {
bus_read_4(sc-mem_res, HPET_TIMER_COMPARATOR(t-num));
now = bus_read_4(sc-mem_res, HPET_MAIN_COUNTER);

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-10 Thread Alexander Motin

Hi.

On 10.05.2011 05:05, Jason Hellenthal wrote:

On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote:

On 10.05.2011 02:48, Doug Barton wrote:


Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do
that, right?


Yes. You can do it in run-time also.


Not quite absolutely sure here but IIRC the last time I tried setting that
via loader.conf in 8-STABLE it was not being set so I eventually added it
to sysctl.conf. Just for reference I never looked into it further.


There is no kern.eventtimer sysctls on 8-STABLE yet, so not sure what 
you were setting.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-10 Thread Andriy Gapon
on 10/05/2011 05:05 Jason Hellenthal said the following:
 
 Alexander,
 
 On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote:
 On 10.05.2011 02:48, Doug Barton wrote:

 Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do
 that, right?

 Yes. You can do it in run-time also.
 
 Not quite absolutely sure here but IIRC the last time I tried setting that 
 via loader.conf in 8-STABLE it was not being set so I eventually added it 
 to sysctl.conf. Just for reference I never looked into it further.

Perhaps you are confusing selection of eventtimer with choice of timecounter?
For the latter indeed there is no tunable, which is a small annoyance.

-- 
Andriy Gapon
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-10 Thread Jason Hellenthal

Alexander,

On Tue, May 10, 2011 at 11:05:04AM +0300, Alexander Motin wrote:
 Hi.
 
 On 10.05.2011 05:05, Jason Hellenthal wrote:
  On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote:
  On 10.05.2011 02:48, Doug Barton wrote:
 
  Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do
  that, right?
 
  Yes. You can do it in run-time also.
 
  Not quite absolutely sure here but IIRC the last time I tried setting that
  via loader.conf in 8-STABLE it was not being set so I eventually added it
  to sysctl.conf. Just for reference I never looked into it further.
 
 There is no kern.eventtimer sysctls on 8-STABLE yet, so not sure what 
 you were setting.
 

Ugh! yeah I had that mixed up with kern.timecounter. Somehow transcribed 
the two.

-- 

 Regards, (jhell)
 Jason Hellenthal



pgpidR443gME7.pgp
Description: PGP signature


Re: My problems with stability on -current

2011-05-10 Thread Doug Barton
I had an interesting result doing nothing but switching from HPET to 
LAPIC ... no crash. Still on the same version of -current (r221566) the 
only thing I've done is to add kern.eventtimer.timer=LAPIC to 
/boot/loader.conf, and so far I haven't been able to get it to crash no 
matter how much I compile, or how much other stuff I do in the 
background. I _can_ get the system heavily loaded enough so that the 
mouse can drag across the screen, windows take visible time to repaint, 
etc. That happens with a load average of 4+ on this core 2 duo. But 
other than that (which is not altogether unreasonable) the system has 
been very stable for a couple of days now.


Does that suggest a next step in terms of what to test?

--

Nothin' ever doesn't change, but nothin' changes much.
-- OK Go

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price.  :)  http://SupersetSolutions.com/

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-09 Thread Alexander Motin

On 10.05.2011 02:48, Doug Barton wrote:

I would start from most obvious problems. I need to know more about
crashes. As usual: how to trigger, stack backtraces, etc.


Triggering is easy, I can start a buildworld with -j2, and a build of
ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system
will reboot. I posted a panic message relative to r220282, (-current
archives, 4/4) but kib said it didn't make any sense. Usually I don't
get a panic at all.


Could you hint me the thread?


Go to http://www.FreeBSD.org/
Click 'mailing lists'
Click 'listed in the FreeBSD Handbook.'
Click freebsd-current
Click freebsd-current Archives
Click April 2011
search for r220282
Voila! :)


OK, but URL would be fine also. :) I am agree with kib@ -- the message 
doesn't match the backtrace.



What's about time problems, I would try to collect more data:
- show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
dmesg outputs;


http://people.freebsd.org/~dougb/dougb-current-r221566.txt


- what eventtimer is used now and does it helps to switch to another
one with kern.eventtimer.timer sysctl?


When I was trying to track down the problems last summer I vaguely
remember trying RTC, but eventually we realized that the real problem
was throttling, so I stopped specifying RTC and let it go back to the
default. What do you suggest I try?


As I see, now you are using HPET (chosen automatically). I would try
switch to the LAPIC. Just make sure to disable C-states if you are
enabled them to be sure that LAPIC timer won't stop.


Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do
that, right?


Yes. You can do it in run-time also.


I don't use C-states (in part as a result of previous investigation) but
I do use powerd as such:
powerd_flags=-a adaptive -b adaptive -n adaptive


- does the timer runs in periodic or one-shot mode and does it helps to
switch to another one?


How could I tell, and how would I switch?


`sysctl kern.eventtimer.periodic`.


kern.eventtimer.periodic: 0


And read eventtimers(4) please.


I did that, but I don't see anything in there as to which choice is
one-shot, and how to change to periodic. I assume 0 is the default,
which I also assume is one-shot. Does setting that to 1 change to
periodic? Also, can I safely do this while the system is running, or
should it be in /boot/loader.conf as well?


Yes, nonzero value means periodic. And yes, changing in run-time is safe.

--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-09 Thread Jason Hellenthal

Alexander,

On Tue, May 10, 2011 at 04:29:25AM +0300, Alexander Motin wrote:
 On 10.05.2011 02:48, Doug Barton wrote:
 
  Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do
  that, right?
 
 Yes. You can do it in run-time also.

Not quite absolutely sure here but IIRC the last time I tried setting that 
via loader.conf in 8-STABLE it was not being set so I eventually added it 
to sysctl.conf. Just for reference I never looked into it further.

-- 

 Regards, (jhell)
 Jason Hellenthal



pgpkLYmqIZwBa.pgp
Description: PGP signature


Re: My problems with stability on -current

2011-05-09 Thread Doug Barton
New symptom, today (still running r221566) I compiled a small port, that 
worked without any freezes or interactivity problems. Then I tried

compiling a larger port (java/openjdk6 if anyone cares) and still no
interactivity problems, but I got the system wedge requiring power
cycle problem I was seeing previously that I tracked to the one-shot
timer update.

More below.

On 05/07/2011 02:43, Alexander Motin wrote:

Doug Barton wrote:

On 05/05/2011 13:55, Alexander Motin wrote:

I see several possibly unrelated problems there:
   - crashes are always crashes. They should be debugged.
   - calcru going backwards could have the same roots as lost wall clock
time.


I think you're right about that. What usually happens when the load
maxes out is that the system visibly freezes for a minute or 2, and when
it comes back to life the log is flooded with calcru messages. If it
stays up long enough after that the wall clock drift becomes noticeable.
This is in spite of running ntpd.


These system freezes are very suspicious. Most time counters need only
few seconds to overflow, some even less. So freeze for few minutes will
easily overflow most of them. So the freezes are probably the cause of
time problems, but the question now is what the cause of freezes. You
should try to investigate what is going on during freezes. Does the
system do anything, are there any interrupts working (`vmstat -i` just
before and after), are there any interrupt storms, etc?


Here is the output on a mostly-idle system, shortly after reboot:

vmstat -i
interrupt  total   rate
irq1: atkbd01784  0
irq9: acpi01  0
irq14: ata0   213355 89
irq15: ata1   58  0
irq17: wpi074331 31
irq20: hpet0 uhci0+   787767331
irq22: uhci2   21453  9
irq256: hdac0 11  0
Total1098760462

At a more opportune time I'll try crashing it again and get another result.


If there are some problems with timer interrupts, timecounters
could wrap unnoticed that will cause random time jumps.
   - interactivity problems. I can't prove it is unrelated, but have no
real ideas now.

I would start from most obvious problems. I need to know more about
crashes. As usual: how to trigger, stack backtraces, etc.


Triggering is easy, I can start a buildworld with -j2, and a build of
ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system
will reboot. I posted a panic message relative to r220282, (-current
archives, 4/4) but kib said it didn't make any sense. Usually I don't
get a panic at all.


Could you hint me the thread?


Go to http://www.FreeBSD.org/
Click 'mailing lists'
Click 'listed in the FreeBSD Handbook.'
Click freebsd-current
Click freebsd-current Archives
Click April 2011
search for r220282
Voila! :)


What's about time problems, I would try to collect more data:
   - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
dmesg outputs;


http://people.freebsd.org/~dougb/dougb-current-r221566.txt


   - what eventtimer is used now and does it helps to switch to another
one with kern.eventtimer.timer sysctl?


When I was trying to track down the problems last summer I vaguely
remember trying RTC, but eventually we realized that the real problem
was throttling, so I stopped specifying RTC and let it go back to the
default. What do you suggest I try?


As I see, now you are using HPET (chosen automatically). I would try
switch to the LAPIC. Just make sure to disable C-states if you are
enabled them to be sure that LAPIC timer won't stop.


Ok, so kern.eventtimer.timer=LAPIC in /boot/loader.conf should do
that, right?

I don't use C-states (in part as a result of previous investigation) but 
I do use powerd as such:

powerd_flags=-a adaptive -b adaptive -n adaptive


   - does the timer runs in periodic or one-shot mode and does it helps to
switch to another one?


How could I tell, and how would I switch?


`sysctl kern.eventtimer.periodic`.


kern.eventtimer.periodic: 0


And read eventtimers(4) please.


I did that, but I don't see anything in there as to which choice is
one-shot, and how to change to periodic. I assume 0 is the default,
which I also assume is one-shot. Does setting that to 1 change to
periodic? Also, can I safely do this while the system is running, or
should it be in /boot/loader.conf as well?


   - if full CPU load makes time to stop, try to track what is going on
with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
CPU load in one-shot mode you should have stable timer interrupt rate
about hz+stathz.


Ok, I'll do that tomorrow, tired now.


   - if timer interrupts are not working well, you can build kernel with
optionsKTR
optionsALQ
optionsKTR_ALQ

Re: My problems with stability on -current

2011-05-07 Thread Doug Barton

On 05/05/2011 13:55, Alexander Motin wrote:

Doug Barton wrote:

Alexander suggested some knobs to twist for the timers, and I'll be glad
to do that once he gets back to me with more concrete suggestions now
that he knows more about my specific problems.


OK, I am all here. While this post is indeed larger then previous, it is
not much more informative. Sorry. :(


I understand.


I see several possibly unrelated problems there:
  - crashes are always crashes. They should be debugged.
  - calcru going backwards could have the same roots as lost wall clock
time.


I think you're right about that. What usually happens when the load
maxes out is that the system visibly freezes for a minute or 2, and when 
it comes back to life the log is flooded with calcru messages. If it 
stays up long enough after that the wall clock drift becomes noticeable. 
This is in spite of running ntpd.



If there are some problems with timer interrupts, timecounters
could wrap unnoticed that will cause random time jumps.
  - interactivity problems. I can't prove it is unrelated, but have no
real ideas now.

I would start from most obvious problems. I need to know more about
crashes. As usual: how to trigger, stack backtraces, etc.


Triggering is easy, I can start a buildworld with -j2, and a build of
ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system 
will reboot. I posted a panic message relative to r220282, (-current 
archives, 4/4) but kib said it didn't make any sense. Usually I don't 
get a panic at all.



What's about time problems, I would try to collect more data:
  - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
dmesg outputs;


http://people.freebsd.org/~dougb/dougb-current-r221566.txt


  - what eventtimer is used now and does it helps to switch to another
one with kern.eventtimer.timer sysctl?


When I was trying to track down the problems last summer I vaguely
remember trying RTC, but eventually we realized that the real problem
was throttling, so I stopped specifying RTC and let it go back to the
default. What do you suggest I try?


  - does the timer runs in periodic or one-shot mode and does it helps to
switch to another one?


How could I tell, and how would I switch?


  - if full CPU load makes time to stop, try to track what is going on
with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
CPU load in one-shot mode you should have stable timer interrupt rate
about hz+stathz.


Ok, I'll do that tomorrow, tired now.


  - if timer interrupts are not working well, you can build kernel with
optionsKTR
optionsALQ
optionsKTR_ALQ
optionsKTR_COMPILE=(KTR_SPARE2)
optionsKTR_ENTRIES=131072
optionsKTR_MASK=(KTR_SPARE2)
to track event timers operation and use ktrdump to save the trace when
problem exist (preferably when it begins).

And let's experiment with fresh CURRENT.


Done and done. I'm up to r221566, and I added those options to my kernel 
config. I ran ktrdump -cH -o ktrdumpfile and posted the results here: 
http://people.freebsd.org/~dougb/ktrdumpfile.txt  This was shortly after 
boot, with no load. Not sure if it helps, but there you go.



Thanks again for your help,

Doug

--

Nothin' ever doesn't change, but nothin' changes much.
-- OK Go

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price.  :)  http://SupersetSolutions.com/

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-07 Thread Alexander Motin
Doug Barton wrote:
 On 05/05/2011 13:55, Alexander Motin wrote:
 I see several possibly unrelated problems there:
   - crashes are always crashes. They should be debugged.
   - calcru going backwards could have the same roots as lost wall clock
 time.
 
 I think you're right about that. What usually happens when the load
 maxes out is that the system visibly freezes for a minute or 2, and when
 it comes back to life the log is flooded with calcru messages. If it
 stays up long enough after that the wall clock drift becomes noticeable.
 This is in spite of running ntpd.

These system freezes are very suspicious. Most time counters need only
few seconds to overflow, some even less. So freeze for few minutes will
easily overflow most of them. So the freezes are probably the cause of
time problems, but the question now is what the cause of freezes. You
should try to investigate what is going on during freezes. Does the
system do anything, are there any interrupts working (`vmstat -i` just
before and after), are there any interrupt storms, etc?

 If there are some problems with timer interrupts, timecounters
 could wrap unnoticed that will cause random time jumps.
   - interactivity problems. I can't prove it is unrelated, but have no
 real ideas now.

 I would start from most obvious problems. I need to know more about
 crashes. As usual: how to trigger, stack backtraces, etc.
 
 Triggering is easy, I can start a buildworld with -j2, and a build of
 ports/www/firefox with FORCE_MAKE_JOBS, and within 30 minutes the system
 will reboot. I posted a panic message relative to r220282, (-current
 archives, 4/4) but kib said it didn't make any sense. Usually I don't
 get a panic at all.

Could you hint me the thread?

 What's about time problems, I would try to collect more data:
   - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
 dmesg outputs;
 
 http://people.freebsd.org/~dougb/dougb-current-r221566.txt
 
   - what eventtimer is used now and does it helps to switch to another
 one with kern.eventtimer.timer sysctl?
 
 When I was trying to track down the problems last summer I vaguely
 remember trying RTC, but eventually we realized that the real problem
 was throttling, so I stopped specifying RTC and let it go back to the
 default. What do you suggest I try?

As I see, now you are using HPET (chosen automatically). I would try
switch to the LAPIC. Just make sure to disable C-states if you are
enabled them to be sure that LAPIC timer won't stop.

   - does the timer runs in periodic or one-shot mode and does it helps to
 switch to another one?
 
 How could I tell, and how would I switch?

`sysctl kern.eventtimer.periodic`. And read eventtimers(4) please.

   - if full CPU load makes time to stop, try to track what is going on
 with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
 CPU load in one-shot mode you should have stable timer interrupt rate
 about hz+stathz.
 
 Ok, I'll do that tomorrow, tired now.
 
   - if timer interrupts are not working well, you can build kernel with
 optionsKTR
 optionsALQ
 optionsKTR_ALQ
 optionsKTR_COMPILE=(KTR_SPARE2)
 optionsKTR_ENTRIES=131072
 optionsKTR_MASK=(KTR_SPARE2)
 to track event timers operation and use ktrdump to save the trace when
 problem exist (preferably when it begins).

 And let's experiment with fresh CURRENT.
 
 Done and done. I'm up to r221566, and I added those options to my kernel
 config. I ran ktrdump -cH -o ktrdumpfile and posted the results here:
 http://people.freebsd.org/~dougb/ktrdumpfile.txt  This was shortly after
 boot, with no load. Not sure if it helps, but there you go.

Dump looks fine, but I need dump specifically for the time of the
problem. As soon as time probably can't be trusted here, it would be
nice to make dump as localized as possible: clear buffer with `sysctl
debug.ktr.clear=1`, trigger freeze for few seconds, stop collecting with
`sysctl debug.ktr.mask=0` and do the dump.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


My problems with stability on -current

2011-05-05 Thread Doug Barton
This is long, sorry. I wish I could condense things down to just the 
answer, or even just the question, but here goes. I've used HEAD on my 
main workstation(s) for many years. It's common for there to be ups and 
downs, and that's fine. Lately however the problems have been debilitating.


First a timeline. Since sometime before January 2008 I've been using a 
Dell Latitude D620 laptop as my primary system. It has a core 2 duo 
running at 2.33 G, and 2 G RAM. I 4xboot it with windows xp, freebsd 
current (amd64), another freebsd (usually 8.N-RELEASE i386) and Ubuntu. 
On the first and last I don't do a lot of compiling obviously, but even 
under heavy load on 8.2-RELEASE I'm not seeing problems, so the problems 
I _am_ seeing are not hardware related.


I keep my system very close to stock. My kernel config is GENERIC minus 
devices I don't have, and plus the following:


options EXT2FS
options IEEE80211_DEBUG # enable debug msgs
options VESA
device  atapicam
device  sound
device  snd_hda
device  snp

I was building with clang for a while, but when the problems started I 
went back to gcc. I still have INVARIANTS on but I disabled WITNESS 
because with all the known+unfixed LORs it's kind of pointless. Nothing 
interesting in make/src.conf either, the latter is just a list of stuff 
not to build, KERNCONF, and MODULES_OVERRIDE.


Starting around December 2009 I started having problems under load with 
-current. Often I reported them, sometimes problems were found, 
sometimes not. In the course of trying to debug those problems I 
disabled throttling, which helped. Switching to SCHED_4BSD also helped 
quite a bit with interactivity under load, although it was still worse 
than on 8.x.


In October of 2010 I was lucky enough to receive a donation of a Dell 
Optiplex desktop that I started using as my primary workstation. Around 
that same time there was some work being done in the scheduler(s) and 
various related systems, and my desktop (which had a slightly faster 
core 2 duo and 4 G RAM) was running great. I assumed that the problems 
were solved.


Then 2 months ago I packed up the desktop system and pulled out the 
laptop again. I updated to the latest -current on the laptop, and all 
heck broke loose. I couldn't do anything on my laptop that created even 
a mediocre load without it crashing. Trying to do something like a 
buildworld (even without -j) would cause the system to absolutely crawl. 
I'd get tons of the dreaded calcru messages about time going 
backwards, and the system clock would lose literally minutes of wall 
clock time. At one point when I could keep it up long enough to build 
the world without crashing it had lost 40 minutes of wall clock time 
when it finished. I think that specific problem happened sometime 
between March 15 and r220282.


In trying to find that problem, I uncovered another, deeper problem with 
the one-shot timers from r212541. In order to make my binary search 
easier for the problem described above I was using a -current snapshot 
CD from August 2010 that I had laying around. I could easily build world 
with -j2, run X, do normal desktop stuff (firefox, thunderbird, pidgin, 
etc.) all at the same time. When I got closer to the more recent 
-current, it would crash as soon as I put a load on it. I eventually 
bifurcated down to that exact commit. I've been running on 212540 for 
over a week now without any problems, including lots of port builds with 
FORCE_MAKE_JOBS, etc.


Alexander suggested some knobs to twist for the timers, and I'll be glad 
to do that once he gets back to me with more concrete suggestions now 
that he knows more about my specific problems.



Doug

--

Nothin' ever doesn't change, but nothin' changes much.
-- OK Go

Breadth of IT experience, and depth of knowledge in the DNS.
Yours for the right price.  :)  http://SupersetSolutions.com/

___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: My problems with stability on -current

2011-05-05 Thread Alexander Motin
Doug Barton wrote:
 Alexander suggested some knobs to twist for the timers, and I'll be glad
 to do that once he gets back to me with more concrete suggestions now
 that he knows more about my specific problems.

OK, I am all here. While this post is indeed larger then previous, it is
not much more informative. Sorry. :(

I see several possibly unrelated problems there:
 - crashes are always crashes. They should be debugged.
 - calcru going backwards could have the same roots as lost wall clock
time. If there are some problems with timer interrupts, timecounters
could wrap unnoticed that will cause random time jumps.
 - interactivity problems. I can't prove it is unrelated, but have no
real ideas now.

I would start from most obvious problems. I need to know more about
crashes. As usual: how to trigger, stack backtraces, etc.

What's about time problems, I would try to collect more data:
 - show `sysctl kern.eventtimer`, `sysctl kern.timecounter` and verbose
dmesg outputs;
 - what eventtimer is used now and does it helps to switch to another
one with kern.eventtimer.timer sysctl?
 - does the timer runs in periodic or one-shot mode and does it helps to
switch to another one?
 - if full CPU load makes time to stop, try to track what is going on
with timer interrupts using `vmstat -i` and `systat -vm 1`. Under full
CPU load in one-shot mode you should have stable timer interrupt rate
about hz+stathz.
 - if timer interrupts are not working well, you can build kernel with
optionsKTR
optionsALQ
optionsKTR_ALQ
optionsKTR_COMPILE=(KTR_SPARE2)
optionsKTR_ENTRIES=131072
optionsKTR_MASK=(KTR_SPARE2)
to track event timers operation and use ktrdump to save the trace when
problem exist (preferably when it begins).

And let's experiment with fresh CURRENT.

-- 
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org