On 28.01.2025 23:50, Ilia Evdokimov wrote:


If anyone has the capability to run this benchmark on machines with more
CPUs or with different queries, it would be nice. I’d appreciate any
suggestions or feedback.
I wanted to share some additional benchmarks I ran as well
on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured
with 16GB of shared_buffers. I also attached the benchmark.sh
script used to generate the output.
The benchmark is running the select-only pgbench workload,
so we have a single heavily contentious entry, which is the
worst case.

The test shows that the spinlock (SpinDelay waits)
becomes an issue at high connection counts and will
become worse on larger machines. A sample_rate going from
1 to .75 shows a 60% improvement; but this is on a single
contentious entry. Most workloads will likely not see this type
of improvement. I also could not really observe
this type of difference on smaller machines ( i.e. 32 vCPUs),
as expected.

## init
pgbench -i -s500

### 192 connections
pgbench -c192 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 484338.769799 (without initial connection time)
waits
-----
   11107  SpinDelay
    9568  CPU
     929  ClientRead
      13  DataFileRead
       3  BufferMapping

sample_rate = .75
tps = 909547.562124 (without initial connection time)
waits
-----
   12079  CPU
    4781  SpinDelay
    2100  ClientRead

sample_rate = .5
tps = 1028594.555273 (without initial connection time)
waits
-----
   13253  CPU
    3378  ClientRead
     174  SpinDelay

sample_rate = .25
tps = 1019507.126313 (without initial connection time)
waits
-----
   13397  CPU
    3423  ClientRead

sample_rate = 0
tps = 1015425.288538 (without initial connection time)
waits
-----
   13106  CPU
    3502  ClientRead

### 32 connections
pgbench -c32 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 620667.049565 (without initial connection time)
waits
-----
    1782  CPU
     560  ClientRead

sample_rate = .75
tps = 620663.131347 (without initial connection time)
waits
-----
    1736  CPU
     554  ClientRead

sample_rate = .5
tps = 624094.688239 (without initial connection time)
waits
-----
    1741  CPU
     648  ClientRead

sample_rate = .25
tps = 628638.538204 (without initial connection time)
waits
-----
    1702  CPU
     576  ClientRead

sample_rate = 0
tps = 630483.464912 (without initial connection time)
waits
-----
    1638  CPU
     574  ClientRead

Regards,

Sami


Thank you so much for benchmarking this on a pretty large machine with a large number of CPUs. The results look fantastic, and I truly appreciate your effort.

BWT, I realized that the 'sampling' test needs to be added not only to the Makefile but also to meson.build. I've included that in the v14 patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.


In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we should determine the CPU count at which it becomes. This will help us understand the scale of the problem. Does this make sense, or are there really no real workloads where the same query runs on more than 32 CPUs, and we've been trying to solve a non-existent problem?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.



Reply via email to