date:20100928

cpu timer issues

2010-09-28 Thread Jurgen Weber

Hello List

We have been having issues with some firewall machines of ours using
pfSense.

FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0: Sun
Dec 6 23:20:31 EST 2009
sullr...@freebsd_7.2_pfsense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7
i386

MotherBoard:
http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm

Originally the systems started out by showing a lot of packet loss, the
system time would fall behind, and the value of #vmstat -i | grep
timer was dropping below 2000. I was lead to believe by the guys at
pfSense that this is where the value should sit. I would also receive
errors in messages that looked like kernel: calcru: runtime went
backwards from 244314 usec to 236341.

We tried a variety of things, disabling USB, turning off the Intel Speed
Step in the BIOS, disabling ACPI, etc, etc. All having little to no
effect. The only thing that would right it is restarting the box but
over time it would degrade again. I talked to the SuperMicro and they
said that this is a FreeBSD issue and pretty much washed their hands of it.

After a couple of months of dealing with this and just rebooting the
systems reguarly, the symptoms slowly but surely disappeared. eg. The
kernel messages went away, the system time was not falling behind and I
was experiencing no packet loss but the #vmstat -i | grep timer value
would continue to decrease over time. Eventually I think, when it
finally got the 0 the machine restarted (I am only guessing here).

After this restart it worked again for a couple of hours and then it
restarted again.

After the second time the system has not missed a beat, it has been fine
and the #vmstat -i | grep timer value remained near the 2000 mark...
We setup some zabbix monitoring to watch it. As mentioned it was fine
for about a month. Until today. Today the value has dropped to 0, but
the system has not restarted and over the last couple of hours the value
has increased to 47.

This machine is mission critical, we have two in a fail over scenario
(using pfSense's CARP features) and it seems unfortunate that we have an
issue with two brand new SuperMicro boxes that affect both machines.
While at the moment everything seems fine I want to ensure that I have
no further issues. Does anyone have any suggestions?

Lastly I have double check both of the below:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
We disabled EIST.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW

# dmesg | grep Timecounter
Timecounter i8254 frequency 1193182 Hz quality 0
Timecounters tick every 1.000 msec
# sysctl kern.timecounter.hardware
kern.timecounter.hardware: i8254

Only have one timer to choose from.

Thanks

Jurgen

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: cpu timer issues

2010-09-28 Thread Jeremy Chadwick

On Tue, Sep 28, 2010 at 05:54:15PM +1000, Jurgen Weber wrote:
Hello List

We have been having issues with some firewall machines of ours using
pfSense.

FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0:
Sun Dec 6 23:20:31 EST 2009
sullr...@freebsd_7.2_pfsense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7
i386

MotherBoard:
http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm

Originally the systems started out by showing a lot of packet loss,
the system time would fall behind, and the value of #vmstat -i |
grep timer was dropping below 2000. I was lead to believe by the
guys at pfSense that this is where the value should sit. I would
also receive errors in messages that looked like kernel: calcru:
runtime went backwards from 244314 usec to 236341.

We tried a variety of things, disabling USB, turning off the Intel
Speed Step in the BIOS, disabling ACPI, etc, etc. All having little
to no effect. The only thing that would right it is restarting the
box but over time it would degrade again. I talked to the SuperMicro
and they said that this is a FreeBSD issue and pretty much washed
their hands of it.

After a couple of months of dealing with this and just rebooting the
systems reguarly, the symptoms slowly but surely disappeared. eg.
The kernel messages went away, the system time was not falling
behind and I was experiencing no packet loss but the #vmstat -i |
grep timer value would continue to decrease over time. Eventually I
think, when it finally got the 0 the machine restarted (I am only
guessing here).

After this restart it worked again for a couple of hours and then it
restarted again.

After the second time the system has not missed a beat, it has been
fine and the #vmstat -i | grep timer value remained near the 2000
mark... We setup some zabbix monitoring to watch it. As mentioned it
was fine for about a month. Until today. Today the value has dropped
to 0, but the system has not restarted and over the last couple of
hours the value has increased to 47.

This machine is mission critical, we have two in a fail over
scenario (using pfSense's CARP features) and it seems unfortunate
that we have an issue with two brand new SuperMicro boxes that
affect both machines. While at the moment everything seems fine I
want to ensure that I have no further issues. Does anyone have any
suggestions?

Lastly I have double check both of the below:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
We disabled EIST.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW

# dmesg | grep Timecounter
Timecounter i8254 frequency 1193182 Hz quality 0
Timecounters tick every 1.000 msec
# sysctl kern.timecounter.hardware
kern.timecounter.hardware: i8254

Only have one timer to choose from.

I have a subrevision of this motherboard in use in production, which ran
RELENG_7 and now runs RELENG_8, without any of the problems you
describe. I don't have any experience with the -LN4 submodel though,
although I do have experience with the X7SBA-LN4.

Our hardware in question:

http://www.supermicro.com/products/system/1U/5015/SYS-5015B-MT.cfm

The machine in question consists of 4 disks (1 OS, 3 ZFS raidz1), uses
both NICs (two separate networks) at gigE rates, handles nightly backups
for all other servers, acts as an NFS server, a time source (ntpd) for
other servers on the network, and a serial console head. Oh, it also
has EIST enabled, and runs powerd with some minor (well-known) tunings
in loader.conf for it.

Secondly, here's our sysctl kern.timecounter tree on our system, in
addition to our SMBIOS details (proving the system is what I say it is).
Note that we have multiple timecounter choices, and APCI-fast is chosen.
I would expect problems if i8254 was chosen, but the question is why
this is being chosen on your systems and why alternate timecounter
choices aren't available. You said you tried booting with ACPI
disabled, which might explain why ACPI-fast or ACPI-safe are missing.

$ sysctl kern.timecounter
kern.timecounter.tick: 1
kern.timecounter.choice: TSC(-100) ACPI-fast(1000) i8254(0) dummy(-100)
kern.timecounter.hardware: ACPI-fast
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 65535
kern.timecounter.tc.i8254.counter: 47135
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-fast.mask: 16777215
kern.timecounter.tc.ACPI-fast.counter: 188736
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.ACPI-fast.quality: 1000
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.counter: 2830682562
kern.timecounter.tc.TSC.frequency: 2333508681
kern.timecounter.tc.TSC.quality: -100
kern.timecounter.smp_tsc: 0
kern.timecounter.invariant_tsc: 1

$ kenv | grep smbios
smbios.bios.reldate=07/24/2009

Re: cpu timer issues

2010-09-28 Thread Andriy Gapon

on 28/09/2010 10:54 Jurgen Weber said the following:
 # dmesg | grep Timecounter
 Timecounter i8254 frequency 1193182 Hz quality 0
 Timecounters tick every 1.000 msec
 # sysctl kern.timecounter.hardware
 kern.timecounter.hardware: i8254
 
 Only have one timer to choose from.

Can you provide a little bit more of hard data than the above?
Specifically, the following sysctls:
kern.timecounter
dev.cpu

Output of vmstat -i.
_Verbose_ boot dmesg.

Please do not disable ACPI when taking this data.
Preferably, upload it somewhere and post a link to it.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Re: cpu timer issues

2010-09-28 Thread borislav nikolov

On 28.09.2010, at 10:54, Jurgen Weber jur...@ish.com.au wrote:

Hello List

We have been having issues with some firewall machines of ours using pfSense.

FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0: Sun Dec
6 23:20:31 EST 2009
sullr...@freebsd_7.2_pfsense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7
i386

MotherBoard:
http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm

Originally the systems started out by showing a lot of packet loss, the
system time would fall behind, and the value of #vmstat -i | grep timer was
dropping below 2000. I was lead to believe by the guys at pfSense that this
is where the value should sit. I would also receive errors in messages that
looked like kernel: calcru: runtime went backwards from 244314 usec to
236341.

We tried a variety of things, disabling USB, turning off the Intel Speed Step
in the BIOS, disabling ACPI, etc, etc. All having little to no effect. The
only thing that would right it is restarting the box but over time it would
degrade again. I talked to the SuperMicro and they said that this is a
FreeBSD issue and pretty much washed their hands of it.

After a couple of months of dealing with this and just rebooting the systems
reguarly, the symptoms slowly but surely disappeared. eg. The kernel messages
went away, the system time was not falling behind and I was experiencing no
packet loss but the #vmstat -i | grep timer value would continue to
decrease over time. Eventually I think, when it finally got the 0 the machine
restarted (I am only guessing here).

After this restart it worked again for a couple of hours and then it
restarted again.

After the second time the system has not missed a beat, it has been fine and
the #vmstat -i | grep timer value remained near the 2000 mark... We setup
some zabbix monitoring to watch it. As mentioned it was fine for about a
month. Until today. Today the value has dropped to 0, but the system has not
restarted and over the last couple of hours the value has increased to 47.

This machine is mission critical, we have two in a fail over scenario (using
pfSense's CARP features) and it seems unfortunate that we have an issue with
two brand new SuperMicro boxes that affect both machines. While at the moment
everything seems fine I want to ensure that I have no further issues. Does
anyone have any suggestions?

Lastly I have double check both of the below:
http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
We disabled EIST.

http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW

# dmesg | grep Timecounter
Timecounter i8254 frequency 1193182 Hz quality 0
Timecounters tick every 1.000 msec
# sysctl kern.timecounter.hardware
kern.timecounter.hardware: i8254

Only have one timer to choose from.

Thanks

Jurgen

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Hello,
vmsat -i calculates interrupt rate based on interrupt count/uptime, and the
interrupt count is 32 bit integer.
With high values of kern.hz it will overflow in few days (with kern.hz=4000 it
will happen every 12 days or so).
If that is the case, use systat -vmstat 1 to get accurate interrupt rate.
That is just fyi, because i was confused once and it scared me abit, and i
started changing counters untill i noticed this.

p.s. please forgive my poor
english___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Still getting kmem exhausted panic

2010-09-28 Thread Willem Jan Withagen


Hi,

This is with stable as of yesterday,but with an un-tunned ZFS box I was 
still able to generate a kmem exhausted panic.

Hard panic, just 3 lines.

The box contains 12Gb memory, runs on a 6 core (with HT) xeon.
6* 2T WD black caviar in raidz2 with 2*512Mb mirrored log.

The box died while rsyncing 5.8T from its partnering system.
(that was the only activity on the box)

So the obvious would to conclude that auto-tuning voor ZFS on 8.1-Stable 
is not yet quite there.


So I guess that we still need tuning advice even for 8.1.
And thus prevent a hard panic.

At the moment trying to 'zfs send | rsh zfs receive' the stuff.
Which seems to run at about 40Mb/sec, and is a lot faster than the rsync 
stuff.


--WjW
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

49 matches

Mail list logo