debugging what might be a perf regression in 17beta2

MARK CALLAGHAN Fri, 05 Jul 2024 20:11:38 -0700

I am seeking advice. For now I hope for a suggestion about changes from
17beta1 to 17beta2 that might cause the problem -- assuming there is a
problem, and not a mistake in my testing.


One of the sysbench microbenchmarks that I run does a table scan with a
WHERE clause that filters out all rows. That WHERE clause is there to
reduce network IO.

While running it on a server with 16 real cores, 12 concurrent queries and
a cached database the query takes ~5% more time on 17beta2 than on 17beta1
or 16.3. Alas, this is a Google Cloud server and perf doesn't work there.

On small servers I have at home I can reproduce the problem without
concurrent queries and 17beta2 is 5% to 10% slower there.

The SQL statement for the scan microbenchmark is:
SELECT * from %s WHERE LENGTH(c) < 0

I will call my small home servers SER4 and PN53. They are described here:
https://smalldatum.blogspot.com/2022/10/small-servers-for-performance-testing-v4.html

The SER4 is a SER 4700u from Beelink and the PN53 is an ASUS ExpertCenter
PN53. Both use an AMD CPU with 8 cores, AMD SMT disabled and Ubuntu 22.04.
The SER4 has an older, slower CPU than the PN53. In all cases I compile
from source using a configure command line like:

./configure --prefix=$pfx --enable-debug CFLAGS="-O2
-fno-omit-frame-pointer"

I used perf to get flamegraphs during the scan microbenchmark and they are
archived here:
https://github.com/mdcallag/mytools/tree/master/bench/bugs/pg17beta2/24Jul5.sysbench.scan

For both SER4 and PN53 the time to finish the scan microbenchmark is ~10%
longer in 17beta2 than it was in 17beta1 and 16.3. On the PN53 the query
takes ~20 seconds with 16.3 and 17beta1 vs ~22.5 seconds for 17beta2 when
the table has 60M rows.

>From the SVG files for SER4 and 17beta2 I see ~2X more time in
slot_getsomeattrs_int vs 17beta1 or 16.3 with all of that time spent in its
child -- tts_buffer_heap_getsomeattrs
<https://draft.blogger.com/blog/post/edit/9149523927864751087/2076930226137683424#>.
That function is defined in src/backend/executor/execTuples.c and that file
has not changed from 17beta1 to 17beta2. But I don't keep up with
individual commits to Postgres so I won't guess as to the root cause.

But the SVG files for PN53 don't show the same problem:

   - for 16.3 I see 85.24% in ExecInterpExpr vs 11.64% in SeqNext
   - for 17beta1 I see 82.82% in ExecInterpExpr vs 14.51% in SeqNext
   - for 17beta2 I see 85.03% in ExecInterpExpr vs 12.31% in SeqNext
   - for 17beta1 and 17beta2 the flamegraphs shows time spent handling page
   faults during SeqNext, and that isn't visible on the 16.3 flamegraph

And then for PN53 looking at slot_getsomeattrs_int, a child of
ExecInterpExpr

   - for 16.3 I see 6.99% in slot_getsomeattrs_int
   - for 17beta1 I see 4.29% in slot_getsomeattrs_int
   - for 17beta2 I see 3.99% in slot_getsomeattrs_int

So at this point I am confused and repeating the test with a slightly
larger table, but I am trying to keep the table small enough to fit in the
Postgres buffer pool. I also have results from tables that are much larger
than memory, and even in that case the problem can be reproduced.

-- 
Mark Callaghan
mdcal...@gmail.com

debugging what might be a perf regression in 17beta2

Reply via email to