subtransaction performance regression [kind of] due to snapshot caching

Andres Freund Mon, 05 Apr 2021 21:35:42 -0700

Hi,

In a recent thread ([1]) I found a performance regression of the
following statement
DO $do$
    BEGIN FOR i IN 1 .. 10000 LOOP
        BEGIN
            EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE 
plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$;
        EXCEPTION WHEN others THEN
        END;
   END LOOP;
END;$do$;


13: 1617.798
14-dev: 34088.505

The time in 14 is spent mostly below:
-   94.58%     0.01%  postgres  postgres            [.] CreateFunction
   - 94.57% CreateFunction
      - 94.49% ProcedureCreate
         - 90.95% record_object_address_dependencies
            - 90.93% recordMultipleDependencies
               - 89.65% isObjectPinned
                  - 89.12% systable_getnext
                     - 89.06% index_getnext_slot
                        - 56.13% index_fetch_heap
                           - 54.82% table_index_fetch_tuple
                              + 53.79% heapam_index_fetch_tuple
                                0.07% heap_hot_search_buffer
                                0.01% ReleaseAndReadBuffer
                                0.01% LockBuffer
                             0.08% heapam_index_fetch_tuple


After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).

What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.

I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?

Snapshot
GetSnapshotData(Snapshot snapshot)
...
            /*
             * We don't include our own XIDs (if any) in the snapshot. It
             * needs to be includeded in the xmin computation, but we did so
             * outside the loop.
             */
            if (pgxactoff == mypgxactoff)
                continue;

The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.

Which then causes HeapTupleSatisfiesMVCC to go down
    if (!HeapTupleHeaderXminCommitted(tuple))
...
        else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
            return false;
        else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
            SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
                        HeapTupleHeaderGetRawXmin(tuple));
        else
        {
            /* it must have aborted or crashed */
            SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
                        InvalidTransactionId);
            return false;
        }

the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
    uint32      i;

    /*
     * Make a quick range check to eliminate most XIDs without looking at the
     * xip arrays.  Note that this is OK even if we convert a subxact XID to
     * its parent below, because a subxact with XID < xmin has surely also got
     * a parent with XID < xmin, while one with XID >= xmax must belong to a
     * parent that was not yet committed at the time of this snapshot.
     */

    /* Any xid < xmin is not in-progress */
    if (TransactionIdPrecedes(xid, snapshot->xmin))
        return false;
    /* Any xid >= xmax is in-progress */
    if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
        return true;


I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).

It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.

See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.


I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...


Greetings,

Andres Freund

[1] https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de

>From 1e1259df48373934be43049d8aa4dad328e96b96 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Mon, 5 Apr 2021 21:21:14 -0700
Subject: [PATCH] snapshot caching vs subtransactions.

---
 src/backend/storage/ipc/procarray.c |  8 ++++++
 src/test/regress/expected/mvcc.out  | 42 +++++++++++++++++++++++++++
 src/test/regress/parallel_schedule  |  2 +-
 src/test/regress/serial_schedule    |  1 +
 src/test/regress/sql/mvcc.sql       | 44 +++++++++++++++++++++++++++++
 5 files changed, 96 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/mvcc.out
 create mode 100644 src/test/regress/sql/mvcc.sql

diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e113a85aed4..bf776286de0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1210,6 +1210,11 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
 	 */
 	MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
 
+	/*
+	 * NB: No need to increment ShmemVariableCache->xactCompletionCount here,
+	 * nobody can see it yet.
+	 */
+
 	LWLockRelease(ProcArrayLock);
 
 	/* ShmemVariableCache->nextXid must be beyond any observed xid. */
@@ -3915,6 +3920,9 @@ XidCacheRemoveRunningXids(TransactionId xid,
 	/* Also advance global latestCompletedXid while holding the lock */
 	MaintainLatestCompletedXid(latestXid);
 
+	/* ... and xactCompletionCount */
+	ShmemVariableCache->xactCompletionCount++;
+
 	LWLockRelease(ProcArrayLock);
 }
 
diff --git a/src/test/regress/expected/mvcc.out b/src/test/regress/expected/mvcc.out
new file mode 100644
index 00000000000..16ed4ddf2d8
--- /dev/null
+++ b/src/test/regress/expected/mvcc.out
@@ -0,0 +1,42 @@
+--
+-- Verify that index scans encountering dead rows produced by an
+-- aborted subtransaction of the current transaction can utilize the
+-- kill_prio_tuple optimization
+--
+-- NB: The table size is currently *not* expected to stay the same, we
+-- don't have logic to trigger opportunistic pruning in cases like
+-- this.
+BEGIN;
+SET LOCAL enable_seqscan = false;
+SET LOCAL enable_indexonlyscan = false;
+SET LOCAL enable_bitmapscan = false;
+-- Can't easily use a unique index, since dead tuples can be found
+-- independent of the kill_prior_tuples optimization.
+CREATE TABLE clean_aborted_self(key int, data text);
+CREATE INDEX clean_aborted_self_key ON clean_aborted_self(key);
+INSERT INTO clean_aborted_self (key, data) VALUES (-1, 'just to allocate metapage');
+-- save index size from before the changes, for comparison
+SELECT pg_relation_size('clean_aborted_self_key') AS clean_aborted_self_key_before \gset
+DO $$
+BEGIN
+    -- iterate often enough to see index growth even on larger-than-default page sizes
+    FOR i IN 1..100 LOOP
+        BEGIN
+	    -- perform index scan over all the inserted keys to get them to be seen as dead
+            IF EXISTS(SELECT * FROM clean_aborted_self WHERE key > 0 AND key < 100) THEN
+	        RAISE data_corrupted USING MESSAGE = 'these rows should not exist';
+            END IF;
+            INSERT INTO clean_aborted_self SELECT g.i, 'rolling back in a sec' FROM generate_series(1, 100) g(i);
+	    -- just some error that's not normally thrown
+	    RAISE reading_sql_data_not_permitted USING MESSAGE = 'round and round again';
+	EXCEPTION WHEN reading_sql_data_not_permitted THEN END;
+    END LOOP;
+END;$$;
+-- show sizes only if they differ
+SELECT :clean_aborted_self_key_before AS size_before, pg_relation_size('clean_aborted_self_key') size_after
+WHERE :clean_aborted_self_key_before != pg_relation_size('clean_aborted_self_key');
+ size_before | size_after 
+-------------+------------
+(0 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 2e898390892..a0913008577 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -29,7 +29,7 @@ test: strings numerology point lseg line box path polygon circle date time timet
 # geometry depends on point, lseg, box, path, polygon and circle
 # horology depends on interval, timetz, timestamp, timestamptz
 # ----------
-test: geometry horology regex type_sanity opr_sanity misc_sanity comments expressions unicode xid
+test: geometry horology regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc
 
 # ----------
 # These four each depend on the previous one
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a46f3d01789..56448476015 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -11,6 +11,7 @@ test: int4
 test: int8
 test: oid
 test: xid
+test: mvcc
 test: float4
 test: float8
 test: bit
diff --git a/src/test/regress/sql/mvcc.sql b/src/test/regress/sql/mvcc.sql
new file mode 100644
index 00000000000..b22a86dc5e5
--- /dev/null
+++ b/src/test/regress/sql/mvcc.sql
@@ -0,0 +1,44 @@
+--
+-- Verify that index scans encountering dead rows produced by an
+-- aborted subtransaction of the current transaction can utilize the
+-- kill_prio_tuple optimization
+--
+-- NB: The table size is currently *not* expected to stay the same, we
+-- don't have logic to trigger opportunistic pruning in cases like
+-- this.
+BEGIN;
+
+SET LOCAL enable_seqscan = false;
+SET LOCAL enable_indexonlyscan = false;
+SET LOCAL enable_bitmapscan = false;
+
+-- Can't easily use a unique index, since dead tuples can be found
+-- independent of the kill_prior_tuples optimization.
+CREATE TABLE clean_aborted_self(key int, data text);
+CREATE INDEX clean_aborted_self_key ON clean_aborted_self(key);
+INSERT INTO clean_aborted_self (key, data) VALUES (-1, 'just to allocate metapage');
+
+-- save index size from before the changes, for comparison
+SELECT pg_relation_size('clean_aborted_self_key') AS clean_aborted_self_key_before \gset
+
+DO $$
+BEGIN
+    -- iterate often enough to see index growth even on larger-than-default page sizes
+    FOR i IN 1..100 LOOP
+        BEGIN
+	    -- perform index scan over all the inserted keys to get them to be seen as dead
+            IF EXISTS(SELECT * FROM clean_aborted_self WHERE key > 0 AND key < 100) THEN
+	        RAISE data_corrupted USING MESSAGE = 'these rows should not exist';
+            END IF;
+            INSERT INTO clean_aborted_self SELECT g.i, 'rolling back in a sec' FROM generate_series(1, 100) g(i);
+	    -- just some error that's not normally thrown
+	    RAISE reading_sql_data_not_permitted USING MESSAGE = 'round and round again';
+	EXCEPTION WHEN reading_sql_data_not_permitted THEN END;
+    END LOOP;
+END;$$;
+
+-- show sizes only if they differ
+SELECT :clean_aborted_self_key_before AS size_before, pg_relation_size('clean_aborted_self_key') size_after
+WHERE :clean_aborted_self_key_before != pg_relation_size('clean_aborted_self_key');
+
+ROLLBACK;
-- 
2.31.0.121.g9198c13e34

subtransaction performance regression [kind of] due to snapshot caching

Reply via email to