Re: Some performance degradation in REL_16 vs REL_15

Andres Freund Wed, 15 Nov 2023 10:05:35 -0800

Hi,

On 2023-11-15 11:33:44 +0300, Anton A. Melnikov wrote:
> The configure options and test scripts on my pc and server were the same:
> export CFLAGS="-O2"
> ./configure --enable-debug --with-perl --with-icu --enable-depend 
> --enable-tap-tests
> #reinstall
> #reinitdb
> #create database bench
> for ((i=0; i<100; i++)); do pgbench -U postgres -i -s8 bench> /dev/null 2>&1;
> psql -U postgres -d bench -c "checkpoint"; RES=$(pgbench -U postgres -c6 -T20 
> -j6 bench;


Even with scale 8 you're likely significantly impacted by contention. And
obviously WAL write latency. See below for why that matters.



> I can't understand why i get the opposite results on my pc and on the server. 
> It is clear that the absolute
> TPS values will be different for various configurations. This is normal. But 
> differences?
> Is it unlikely that some kind of reference configuration is needed to 
> accurately
> measure the difference in performance. Probably something wrong with my pc, 
> but now
> i can not figure out what's wrong.

One very common reason for symptoms like this are power-saving measures by the
CPU. In workloads where the CPU is not meaningfully utilized, the CPU will go
into a powersaving mode - which can cause workloads that are latency sensitive
to be badly affected.  Both because initially the cpu will just work at a
lower frequency and because it takes time to shift to a higher latency.


Here's an example:
I bound the server and psql to the same CPU core (nothing else is allowed to
use that core. And ran the following:

\o /dev/null
SELECT 1; SELECT 1; SELECT 1; SELECT pg_sleep(0.1); SELECT 1; SELECT 1; SELECT 
1;
Time: 0.181 ms
Time: 0.085 ms
Time: 0.071 ms
Time: 100.474 ms
Time: 0.153 ms
Time: 0.077 ms
Time: 0.069 ms

You can see how the first query timing was slower, the next two were faster,
and then after the pg_sleep() it's slow again.


# tell the CPU to optimize for performance not power
cpupower frequency-set --governor performance

# disable going to lower power states
cpupower idle-set -D0

# disable turbo mode for consistent performance
echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

Now the timings are:
Time: 0.038 ms
Time: 0.028 ms
Time: 0.025 ms
Time: 1000.262 ms (00:01.000)
Time: 0.027 ms
Time: 0.024 ms
Time: 0.023 ms

Look, fast and reasonably consistent timings.

Switching back:
Time: 0.155 ms
Time: 0.123 ms
Time: 0.074 ms
Time: 1001.235 ms (00:01.001)
Time: 0.120 ms
Time: 0.077 ms
Time: 0.068 ms


The perverse thing is that this often means that *reducing* the number of
instructions executed yields to *worse* behaviour when under non-sustained
load, because from the CPUs point of view there is less need to increase clock
speed.


To show how much of a difference that can make, I ran pgbench with a single
client on one core, and the server on another (so the CPU is idle inbetween):
numactl --physcpubind 11 pgbench -n -M prepared -P1 -S -c 1 -T10

With power optimized configuration:
latency average = 0.035 ms
latency stddev = 0.002 ms
initial connection time = 5.255 ms
tps = 28434.334672 (without initial connection time)

With performance optimized configuration:
latency average = 0.025 ms
latency stddev = 0.001 ms
initial connection time = 3.544 ms
tps = 40079.995935 (without initial connection time)

That's a whopping 1.4x in throughput!


Now, the same thing, except that I used a custom workload where pgbench
transactions are executed in a pipelined fashion, 100 read-only transactions
in one script execution:

With power optimized configuration:
latency average = 1.055 ms
latency stddev = 0.125 ms
initial connection time = 6.915 ms
tps = 947.985286 (without initial connection time)

(this means we actually executed 94798.5286 readonly pgbench transactions/s)

With performance optimized configuration:
latency average = 1.376 ms
latency stddev = 0.083 ms
initial connection time = 3.759 ms
tps = 726.849018 (without initial connection time)

Suddenly the super-duper performance optimized settings are worse (but note
that stddev is down)! I suspect the problem is that now because we disabled
idle states, the cpu ends up clocking *lower*, due to power usage.

If I just change the relevant *cores* to the performance optimized
configuration:

cpupower -c 10,11 idle-set -D0; cpupower -c 10,11 frequency-set --governor 
performance

latency average = 0.940 ms
latency stddev = 0.061 ms
initial connection time = 3.311 ms
tps = 1063.719116 (without initial connection time)

It wins again.


Now, realistically you'd never use -D0 (i.e. disabling all downclocking, not
just lower states) - the power differential is quite big and as shown here it
can hurt performance as well.

On an idle system, looking at the cpu power usage with:
powerstat -D -R 5 1000

  Time    User  Nice   Sys  Idle    IO  Run Ctxt/s  IRQ/s Fork Exec Exit  Watts 
  pkg-0   dram  pkg-1
09:45:03   0.6   0.0   0.2  99.2   0.0    1   2861   2823    0    0    0  46.84 
  24.82   3.68  18.33
09:45:08   1.0   0.0   0.1  99.0   0.0    2   2565   1602    0    0    0  54.78 
  28.92   4.74  21.12
09:45:13   0.8   0.0   0.3  98.9   0.0    1   3769   2322    0    0    0  55.65 
  29.43   4.72  21.50
09:45:18   0.8   0.0   0.1  99.1   0.0    3   2513   1479    0    0    0  51.95 
  27.47   4.23  20.24
09:45:23   0.6   0.0   0.1  99.3   0.0    1   2282   1448    0    0    0  49.44 
  26.12   3.91  19.41
09:45:28   0.8   0.0   0.1  99.1   0.0    2   2422   1465    0    0    0  51.79 
  27.33   4.27  20.19
09:45:33   0.9   0.0   0.1  99.0   0.0    2   2358   1566    0    0    0  55.05 
  29.03   4.73  21.29
09:45:38   0.6   0.0   0.1  99.4   0.0    1   2976   4207    0    0    0  54.38 
  29.08   4.02  21.28
09:45:43   1.2   0.0   0.2  98.6   0.0    2   3988  37670    0    0    0 125.51 
  64.15   7.97  53.39
09:45:48   0.6   0.0   0.2  99.2   0.0    2   3263  40000    0    0    0 126.31 
  63.84   7.97  54.50
09:45:53   0.4   0.0   0.0  99.6   0.0    1   2333  39716    0    0    0 125.94 
  63.64   7.90  54.40
09:45:58   0.3   0.0   0.0  99.6   0.0    1   2783  39795    0    0    0 125.93 
  63.58   7.90  54.44
09:46:03   0.6   0.0   0.2  99.2   0.0    1   3081  24910    0    0    0  93.55 
  47.69   6.10  39.75
09:46:08   0.5   0.0   0.1  99.4   0.0    1   2623   1356    0    0    0  42.65 
  22.59   3.18  16.89

(interesting columns are Watts, pkg-0, dram, pkg-1)

Initially this was with cpupower idle-set -E, then I ran cpupower idle-set
-D0, and then cpupower idle-set -E - you can easily see where.


If I use a more sensible -D5 system-wide and run the pgbench from earlier, I
get the performance *and* sensible power usage:

latency average = 0.946 ms
latency stddev = 0.063 ms
initial connection time = 3.482 ms
tps = 1057.413946 (without initial connection time)

09:50:36   0.4   0.0   0.1  99.5   0.0    1   2593   1206    0    0    0  42.63 
  22.47   3.18  16.97


To get back to your benchmark: Because you're measuring a highly contended
system, where most of the time postgres will just wait for
  a) row-level locks to be released
  b) WAL writes to finish (you're clearly not using a very fast disk)
the CPU would have plenty time to clock down.


Benchmarking sucks. Modern hardware realities suck.

Greetings,

Andres Freund

Re: Some performance degradation in REL_16 vs REL_15

Reply via email to