Over in the "add timing of buffer I/O requests" thread I mentioned having a system where EXPLAIN ANALYZE of a modest COUNT(*) takes 10X as long as just executing the statement. Attached is a bit of SQL and a script that runs it multiple times that demonstrate the problem on systems that have it. It assumes you've installed the Dell Store 2 database. I'd expect it to work with simpler example too (just create something with 20,000 rows), but I haven't actually confirmed that yet.

From what I've been able to measure so far, here's the state of things:

-If you have a system with a working TSC clock source (timing data is pulled right from the CPU), timing overhead is reasonable enough that you might turn it on even for things that happen frequently, such as the buffer I/O timing patch enables.

-Some systems have tsc, but it doesn't work right in multi-core setups. Newer kernel versions know this and disable it, older ones let you pick tsc anyway and bad things occur.

-Using any of the other timing mechanism--hpt, acpi_pm, and jiffies--has extremely high timing overhead.

This doesn't conflict with the comment we got from Martijn van Oosterhout in the other thread: "Learned the hard way while tracking clock-skew on a multicore system. The hpet may not be the fastest (that would be the cpu timer), but it's the fastest (IME) that gives guaranteed monotonic time." There's more background about TSC, HPET, and an interesting caveat about VMs I came across:

http://en.wikipedia.org/wiki/Time_Stamp_Counter
http://en.wikipedia.org/wiki/High_Precision_Event_Timer
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1007020

I'm starting to envision what a new documentation section discussing timer overhead would look like now, for Linux at least. And I'm guessing other people can simulate this problem even on hardware that doesn't have it in the default config, simply by manually switching to one of the slower timing methods at boot.

The test case is just doing this:

\timing
select count(*) from customers;
explain analyze select count(*) from customers;

Lots of times, with some patterns to try and avoid caching effects.

The system with the worst clock timing issues I've found, the one that was so obvious it kicked off this investigation, is my Thinkpad T60 laptop. The overhead is worst on the old 32-bit Ubuntu 9.04 installation I have there. Here's some info about the system, starting with the 4 clock sources it supports:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm jiffies tsc
$ cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
$ uname -a
Linux meddle 2.6.28-19-generic #61-Ubuntu SMP Wed May 26 23:35:15 UTC 2010 i686 GNU/Linux
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=9.04
DISTRIB_CODENAME=jaunty
DISTRIB_DESCRIPTION="Ubuntu 9.04"
$ psql -c "select version()"
PostgreSQL 8.3.11 on i486-pc-linux-gnu, compiled by GCC gcc-4.3.real (Ubuntu 4.3.3-5ubuntu4) 4.3.3

You change the time source by adding a stanza like this to the kernel boot options:

clocksource=acpi_pm

Martijn said you can just cat a new clock source into the setting instead, I haven't evaluated that yet. Wanted to get my baselines done in a situation where the kernel boots using the one I wanted, which turns out to be useful for a number of reasons. Boot time is one of the things that varied a lot when the timer was switched.

The hpet time source is the default on my T60 system, but there are 3 others available. Of those, it appears that only acpi_pm is a viable alternative. I gave up on the system booting after 15 minutes when using jiffies. I could boot with tsc, but everything was erratic. I could barely use the mouse for example. tsc did have a much faster clock as measured by EXPLAIN overhead though. Here are the results, two runs of the test program that I've de-interleaved so it's a block of fast (count only) runs then a block of slow (with EXPLAIN ANALYZE):

Time Source:hpet

Time: 14.783 ms
Time: 9.021 ms
Time: 9.170 ms
Time: 8.634 ms
Time: 5.632 ms
Time: 5.520 ms
Time: 15.276 ms
Time: 9.477 ms
Time: 9.485 ms

Time: 70.720 ms
Time: 67.577 ms
Time: 57.959 ms
Time: 74.661 ms
Time: 73.557 ms
Time: 74.963 ms
Time: 87.898 ms
Time: 82.125 ms
Time: 81.443 ms

Time Source:acpi_pm

Time: 8.124 ms
Time: 5.020 ms
Time: 4.887 ms
Time: 8.125 ms
Time: 5.029 ms
Time: 4.881 ms

Time: 82.986 ms
Time: 82.366 ms
Time: 82.609 ms
Time: 83.089 ms
Time: 82.438 ms
Time: 82.539 ms

Time Source:tsc
Time: 8.371 ms
Time: 4.673 ms
Time: 4.901 ms
Time: 8.409 ms
Time: 4.943 ms
Time: 4.722 ms

Time: 16.436 ms
Time: 16.349 ms
Time: 16.139 ms
Time: 19.871 ms
Time: 17.175 ms
Time: 16.540 ms

There's the 10:1 possible slowdown I was talking about when using the default hpet clock, while the faster (but fundamentally unstable) tsc clock does better, only around a 4:1 slowdown there.

First question I considered: is it the operating system or kernel causing this, rather than the hardware? I have Debian Squeeze running a 64-bit 2.6.32 kernel installed on here too. It only has two time sources available, with hpet being the default:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
hpet acpi_pm
$ uname -a
Linux meddle 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64 GNU/Linux
$ psql -c "select version()"
                                               version
------------------------------------------------------------------------------------------------------
PostgreSQL 8.4.9 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit

It's not coincidence that these are the only two clock sources that actually worked in the older kernel. jiffies are old stuff that's been on the deprecation path for a while, and it's eliminating TSC with this message during boot:

[    0.168139] checking TSC synchronization [CPU#0 -> CPU#1]:
[ 0.172001] Measured 617820 cycles TSC warp between CPUs, turning off TSC clock.
[    0.172001] Marking TSC unstable due to check_tsc_sync_source failed

So that there can be problems with TSC on some hardware is well known enough that the newer kernel tests, then rules it out as unusable when it fails that test.

Here's the same timing results on this kernel:

Time Source:hpet
Time: 5.701 ms
Time: 3.581 ms
Time: 3.560 ms
Time: 5.696 ms
Time: 3.587 ms
Time: 3.342 ms

Time: 43.492 ms
Time: 43.227 ms
Time: 43.319 ms
Time: 43.503 ms
Time: 43.332 ms
Time: 43.244 ms

Time Source:acpi_pm
Time: 6.350 ms
Time: 3.792 ms
Time: 3.490 ms
Time: 5.806 ms
Time: 3.672 ms
Time: 3.538 ms

Time: 80.093 ms
Time: 79.875 ms
Time: 79.789 ms
Time: 80.033 ms
Time: 80.330 ms
Time: 80.062 ms

That's much better in every way on hpet. But the time ratio hasn't really improved; it's still at least 10:1. Unless there's some blacker magic in an even newer kernel, it seems nothing I can do to this laptop will make it run gettimeofday() quickly.

Let's move onto something that's actually a usable server. My main desktop has an Intel i7-860. It has the same 64-bit Debian Squeeze setup as the second set of T60 results I just showed. The only thing different about the two systems is the hardware, software is as close to identical as I could make it. (That's for my own general sanity, not something special I did for this test--all my systems have that same Debian package set)

But there are different clock options available:

$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
tsc hpet acpi_pm
$ uname -a
Linux grace 2.6.32-5-amd64 #1 SMP Thu Nov 3 03:41:26 UTC 2011 x86_64 GNU/Linux
$ psql -c "select version()"
PostgreSQL 8.4.9 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Debian 4.4.5-8) 4.4.5, 64-bit

With tsc being the default clock source here. A passing test of it looks like this:

Here's some related dmesg output:

[    0.000000] Fast TSC calibration using PIT
...
    0.244640] checking TSC synchronization [CPU#0 -> CPU#1]: passed.
...
[    0.891992] checking TSC synchronization [CPU#0 -> CPU#7]: passed.
[    0.911993] Brought up 8 CPUs

And here's how it performs:

Time Source:tsc
Time: 2.915 ms
Time: 1.531 ms
Time: 1.500 ms
Time: 2.926 ms
Time: 1.518 ms
Time: 1.478 ms

Time: 2.945 ms
Time: 2.915 ms
Time: 2.882 ms
Time: 2.920 ms
Time: 2.862 ms
Time: 2.856 ms

That's more like it.

Next up I plan to survey a few AMD systems and a newer laptop, but I'm out of time for this tonight.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us

#!/bin/bash
echo -n Time Source:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource 
psql -d dellstore2 -e -f explain-overhead.sql | grep Time
\timing

-- Count Time
select count(*) from customers;
select count(*) from customers;
select count(*) from customers;

-- EXPLAIN ANALYZE Count Time
explain analyze select count(*) from customers;
explain analyze select count(*) from customers;
explain analyze select count(*) from customers;
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to