On Sat, Dec 6, 2025 at 3:04 PM Peter Geoghegan <[email protected]> wrote:
> My best guess is that the benefits I see come from eliminating a
> dependent load. Without the second patch applied, I see this
> disassembly for _bt_checkkeys:
>
> mov    rax,QWORD PTR [rdi+0x38]   ; Load scan->opaque
> mov    r15d,DWORD PTR [rax+0x70]  ; Load so->dir
>
> A version with the second patch applied still loads a pointer passed
> by the _bt_checkkeys caller (_bt_readpage), but doesn't have to chase
> another pointer to get to it. Maybe this significantly ameliorates
> execution port pressure in the cases where I see a speedup?

I found a way to further speed up the queries that the second patch
already helped with, following profiling with perf: if _bt_readpage
takes a local copy of scan->ignore_killed_tuples when first called,
and then uses that local copy within its per-tuple loop (instead of
using scan->ignore_killed_tuples directly), it gives me an additional
1% speedup over what I reported earlier today. In other words, the
range/BETWEEN pgbench variant I summarized earlier today goes from
being about 4.5% faster than master, to being about ~5.5% faster than
master. Testing has also shown that the ignore_killed_tuples
enhancement doesn't significantly change the picture with other types
of queries (such as the default pgbench SELECT).

In short, this ignore_killed_tuples change makes the second patch from
v1 more effective, seemingly by further ameliorating the same
bottleneck. Apparently accessing scan->ignore_killed_tuples created
another load-use hazard in the same tight inner loop (the per-tuple
_bt_readpage loop). Which matters with these queries, where we don't
need to do very much work per-tuple (_bt_readpage's pstate.startikey
optimization is as effective as possible here) and have quite a few
tuples (2,000 tuples) that need to be returned by each test query run.

Since this ignore_killed_tuples change is also very simple, and also
seems like an easy win, I think that it can be committed as part of
the second patch. Without it needing to wait for too much more
performance validation.

Attached are 2 text files showing pgbench output/summary info,
generated by my test script (both are from runs that took place within
the last 2 hours). One of these result sets just confirms what I
reported earlier on, with an unmodified v1 patchset. The other set of
results/file shows detailed results for the v1 patchset with the
ignore_killed_tuples change also applied, for the same pgbench
config/workload. This second file gives full details to back up my
"~5.5% faster than master" claim.

The pgbench script used for this is as follows:

\set aid random_exponential(1, 100000 * :scale, 3.0)
\set endrange :aid + 2000
SELECT abalance FROM pgbench_accounts WHERE aid between :aid AND :endrange;

I'm deliberately not attaching a new v2 for this ignore_killed_tuples
change right now. The first patch is a few hundred KBs, and I don't
want this email to get held up in moderation. Though I will attach the
ignore_killed_tuples change in its own patch, which I've also attached
(with a .txt extension, just to avoid confusing CFTester).

-- 
Peter Geoghegan
From 19ec9bd82d6cd6369c3413e10234c17fd9973df4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <[email protected]>
Date: Sat, 6 Dec 2025 16:28:19 -0500
Subject: [PATCH 3/3] Use ignore_killed_tuples local variable

---
 src/backend/access/nbtree/nbtreadpage.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/nbtree/nbtreadpage.c 
b/src/backend/access/nbtree/nbtreadpage.c
index 540d172cc..7f9c66b8b 100644
--- a/src/backend/access/nbtree/nbtreadpage.c
+++ b/src/backend/access/nbtree/nbtreadpage.c
@@ -141,7 +141,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
OffsetNumber offnum,
        OffsetNumber minoff;
        OffsetNumber maxoff;
        BTReadPageState pstate;
-       bool            arrayKeys;
+       bool            arrayKeys,
+                               ignore_killed_tuples = 
scan->ignore_killed_tuples;
        int                     itemIndex,
                                indnatts;
 
@@ -246,7 +247,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
OffsetNumber offnum,
                         * If the scan specifies not to return killed tuples, 
then we
                         * treat a killed tuple as not passing the qual
                         */
-                       if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+                       if (ignore_killed_tuples && ItemIdIsDead(iid))
                        {
                                offnum = OffsetNumberNext(offnum);
                                continue;
@@ -404,7 +405,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, 
OffsetNumber offnum,
                         * uselessly advancing to the page to the left.  This 
is similar
                         * to the high key optimization used by forward scans.
                         */
-                       if (scan->ignore_killed_tuples && ItemIdIsDead(iid))
+                       if (ignore_killed_tuples && ItemIdIsDead(iid))
                        {
                                if (offnum > minoff)
                                {
-- 
2.51.0

Attachment: v1-plus-ignore_killed_tuples-change.out
Description: Binary data

Attachment: v1-only.out
Description: Binary data

Reply via email to