On 28.01.2025 23:50, Ilia Evdokimov wrote:
If anyone has the capability to run this benchmark on machines with
more
CPUs or with different queries, it would be nice. I’d appreciate any
suggestions or feedback.
I wanted to share some additional benchmarks I ran as well
on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured
with 16GB of shared_buffers. I also attached the benchmark.sh
script used to generate the output.
The benchmark is running the select-only pgbench workload,
so we have a single heavily contentious entry, which is the
worst case.
The test shows that the spinlock (SpinDelay waits)
becomes an issue at high connection counts and will
become worse on larger machines. A sample_rate going from
1 to .75 shows a 60% improvement; but this is on a single
contentious entry. Most workloads will likely not see this type
of improvement. I also could not really observe
this type of difference on smaller machines ( i.e. 32 vCPUs),
as expected.
## init
pgbench -i -s500
### 192 connections
pgbench -c192 -j20 -S -Mprepared -T120 --progress 10
sample_rate = 1
tps = 484338.769799 (without initial connection time)
waits
-----
11107 SpinDelay
9568 CPU
929 ClientRead
13 DataFileRead
3 BufferMapping
sample_rate = .75
tps = 909547.562124 (without initial connection time)
waits
-----
12079 CPU
4781 SpinDelay
2100 ClientRead
sample_rate = .5
tps = 1028594.555273 (without initial connection time)
waits
-----
13253 CPU
3378 ClientRead
174 SpinDelay
sample_rate = .25
tps = 1019507.126313 (without initial connection time)
waits
-----
13397 CPU
3423 ClientRead
sample_rate = 0
tps = 1015425.288538 (without initial connection time)
waits
-----
13106 CPU
3502 ClientRead
### 32 connections
pgbench -c32 -j20 -S -Mprepared -T120 --progress 10
sample_rate = 1
tps = 620667.049565 (without initial connection time)
waits
-----
1782 CPU
560 ClientRead
sample_rate = .75
tps = 620663.131347 (without initial connection time)
waits
-----
1736 CPU
554 ClientRead
sample_rate = .5
tps = 624094.688239 (without initial connection time)
waits
-----
1741 CPU
648 ClientRead
sample_rate = .25
tps = 628638.538204 (without initial connection time)
waits
-----
1702 CPU
576 ClientRead
sample_rate = 0
tps = 630483.464912 (without initial connection time)
waits
-----
1638 CPU
574 ClientRead
Regards,
Sami
Thank you so much for benchmarking this on a pretty large machine with
a large number of CPUs. The results look fantastic, and I truly
appreciate your effort.
BWT, I realized that the 'sampling' test needs to be added not only to
the Makefile but also to meson.build. I've included that in the v14
patch.
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.
In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we
should determine the CPU count at which it becomes. This will help us
understand the scale of the problem. Does this make sense, or are there
really no real workloads where the same query runs on more than 32 CPUs,
and we've been trying to solve a non-existent problem?
--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.