Hi,
In a recent thread ([1]) I found a performance regression of the
following statement
DO $do$
BEGIN FOR i IN 1 .. 10000 LOOP
BEGIN
EXECUTE $cf$CREATE OR REPLACE FUNCTION foo() RETURNS VOID LANGUAGE
plpgsql AS $f$BEGIN frakbar; END;$f$;$cf$;
EXCEPTION WHEN others THEN
END;
END LOOP;
END;$do$;
13: 1617.798
14-dev: 34088.505
The time in 14 is spent mostly below:
- 94.58% 0.01% postgres postgres [.] CreateFunction
- 94.57% CreateFunction
- 94.49% ProcedureCreate
- 90.95% record_object_address_dependencies
- 90.93% recordMultipleDependencies
- 89.65% isObjectPinned
- 89.12% systable_getnext
- 89.06% index_getnext_slot
- 56.13% index_fetch_heap
- 54.82% table_index_fetch_tuple
+ 53.79% heapam_index_fetch_tuple
0.07% heap_hot_search_buffer
0.01% ReleaseAndReadBuffer
0.01% LockBuffer
0.08% heapam_index_fetch_tuple
After a bit of debugging I figured out that the direct failure lies with
623a9ba79b. The problem is that subtransaction abort does not increment
ShmemVariableCache->xactCompletionCount. That's trivial to remedy (we
already lock ProcArrayLock during XidCacheRemoveRunningXids).
What happens is that heap_hot_search_buffer()-> correctly recognizes the
aborted subtransaction's rows as dead, but they are not recognized as
"surely dead". Which then leads to O(iterations^2) index->heap lookups,
because the rows from previous iterations are never recognized as dead.
I initially was a bit worried that this could be a correctness issue as
well. The more I thought about it the more confused I got. A
transaction's subtransaction abort should not actually change the
contents of a snapshot, right?
Snapshot
GetSnapshotData(Snapshot snapshot)
...
/*
* We don't include our own XIDs (if any) in the snapshot. It
* needs to be includeded in the xmin computation, but we did so
* outside the loop.
*/
if (pgxactoff == mypgxactoff)
continue;
The sole reason for the behavioural difference is that the cached
snapshot's xmax is *lower* than a new snapshot's would be, because
GetSnapshotData() initializes xmax as
ShmemVariableCache->latestCompletedXid - which
XidCacheRemoveRunningXids() incremented, without incrementing
ShmemVariableCache->xactCompletionCount.
Which then causes HeapTupleSatisfiesMVCC to go down
if (!HeapTupleHeaderXminCommitted(tuple))
...
else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
return false;
else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
HeapTupleHeaderGetRawXmin(tuple));
else
{
/* it must have aborted or crashed */
SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
InvalidTransactionId);
return false;
}
the "return false" for XidInMVCCSnapshot() rather than the
SetHintBits(HEAP_XMIN_INVALID) path. Which then in turn causes
HeapTupleIsSurelyDead() to not recognize the rows as surely dead.
bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
uint32 i;
/*
* Make a quick range check to eliminate most XIDs without looking at the
* xip arrays. Note that this is OK even if we convert a subxact XID to
* its parent below, because a subxact with XID < xmin has surely also got
* a parent with XID < xmin, while one with XID >= xmax must belong to a
* parent that was not yet committed at the time of this snapshot.
*/
/* Any xid < xmin is not in-progress */
if (TransactionIdPrecedes(xid, snapshot->xmin))
return false;
/* Any xid >= xmax is in-progress */
if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
return true;
I *think* this issue doesn't lead to actually wrong query results. For
HeapTupleSatisfiesMVCC purposes there's not much of a difference between
an aborted transaction and one that's "in progress" according to the
snapshot (that's required - we don't check for aborts for xids in the
snapshot).
It is a bit disappointing that there - as far as I could find - are no
tests for kill_prior_tuple actually working. I guess that lack, and that
there's no difference in query results, explains why nobody noticed the
issue in the last ~9 months.
See the attached fix. I did include a test that verifies that the
kill_prior_tuples optimization actually prevents the index from growing,
when subtransactions are involved. I think it should be stable, even
with concurrent activity. But I'd welcome a look.
I don't think that's why the issue exists, but I very much hate the
XidCache* name. Makes it sound much less important than it is...
Greetings,
Andres Freund
[1] https://postgr.es/m/20210317055718.v6qs3ltzrformqoa%40alap3.anarazel.de
>From 1e1259df48373934be43049d8aa4dad328e96b96 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Mon, 5 Apr 2021 21:21:14 -0700
Subject: [PATCH] snapshot caching vs subtransactions.
---
src/backend/storage/ipc/procarray.c | 8 ++++++
src/test/regress/expected/mvcc.out | 42 +++++++++++++++++++++++++++
src/test/regress/parallel_schedule | 2 +-
src/test/regress/serial_schedule | 1 +
src/test/regress/sql/mvcc.sql | 44 +++++++++++++++++++++++++++++
5 files changed, 96 insertions(+), 1 deletion(-)
create mode 100644 src/test/regress/expected/mvcc.out
create mode 100644 src/test/regress/sql/mvcc.sql
diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c
index e113a85aed4..bf776286de0 100644
--- a/src/backend/storage/ipc/procarray.c
+++ b/src/backend/storage/ipc/procarray.c
@@ -1210,6 +1210,11 @@ ProcArrayApplyRecoveryInfo(RunningTransactions running)
*/
MaintainLatestCompletedXidRecovery(running->latestCompletedXid);
+ /*
+ * NB: No need to increment ShmemVariableCache->xactCompletionCount here,
+ * nobody can see it yet.
+ */
+
LWLockRelease(ProcArrayLock);
/* ShmemVariableCache->nextXid must be beyond any observed xid. */
@@ -3915,6 +3920,9 @@ XidCacheRemoveRunningXids(TransactionId xid,
/* Also advance global latestCompletedXid while holding the lock */
MaintainLatestCompletedXid(latestXid);
+ /* ... and xactCompletionCount */
+ ShmemVariableCache->xactCompletionCount++;
+
LWLockRelease(ProcArrayLock);
}
diff --git a/src/test/regress/expected/mvcc.out b/src/test/regress/expected/mvcc.out
new file mode 100644
index 00000000000..16ed4ddf2d8
--- /dev/null
+++ b/src/test/regress/expected/mvcc.out
@@ -0,0 +1,42 @@
+--
+-- Verify that index scans encountering dead rows produced by an
+-- aborted subtransaction of the current transaction can utilize the
+-- kill_prio_tuple optimization
+--
+-- NB: The table size is currently *not* expected to stay the same, we
+-- don't have logic to trigger opportunistic pruning in cases like
+-- this.
+BEGIN;
+SET LOCAL enable_seqscan = false;
+SET LOCAL enable_indexonlyscan = false;
+SET LOCAL enable_bitmapscan = false;
+-- Can't easily use a unique index, since dead tuples can be found
+-- independent of the kill_prior_tuples optimization.
+CREATE TABLE clean_aborted_self(key int, data text);
+CREATE INDEX clean_aborted_self_key ON clean_aborted_self(key);
+INSERT INTO clean_aborted_self (key, data) VALUES (-1, 'just to allocate metapage');
+-- save index size from before the changes, for comparison
+SELECT pg_relation_size('clean_aborted_self_key') AS clean_aborted_self_key_before \gset
+DO $$
+BEGIN
+ -- iterate often enough to see index growth even on larger-than-default page sizes
+ FOR i IN 1..100 LOOP
+ BEGIN
+ -- perform index scan over all the inserted keys to get them to be seen as dead
+ IF EXISTS(SELECT * FROM clean_aborted_self WHERE key > 0 AND key < 100) THEN
+ RAISE data_corrupted USING MESSAGE = 'these rows should not exist';
+ END IF;
+ INSERT INTO clean_aborted_self SELECT g.i, 'rolling back in a sec' FROM generate_series(1, 100) g(i);
+ -- just some error that's not normally thrown
+ RAISE reading_sql_data_not_permitted USING MESSAGE = 'round and round again';
+ EXCEPTION WHEN reading_sql_data_not_permitted THEN END;
+ END LOOP;
+END;$$;
+-- show sizes only if they differ
+SELECT :clean_aborted_self_key_before AS size_before, pg_relation_size('clean_aborted_self_key') size_after
+WHERE :clean_aborted_self_key_before != pg_relation_size('clean_aborted_self_key');
+ size_before | size_after
+-------------+------------
+(0 rows)
+
+ROLLBACK;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 2e898390892..a0913008577 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -29,7 +29,7 @@ test: strings numerology point lseg line box path polygon circle date time timet
# geometry depends on point, lseg, box, path, polygon and circle
# horology depends on interval, timetz, timestamp, timestamptz
# ----------
-test: geometry horology regex type_sanity opr_sanity misc_sanity comments expressions unicode xid
+test: geometry horology regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc
# ----------
# These four each depend on the previous one
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index a46f3d01789..56448476015 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -11,6 +11,7 @@ test: int4
test: int8
test: oid
test: xid
+test: mvcc
test: float4
test: float8
test: bit
diff --git a/src/test/regress/sql/mvcc.sql b/src/test/regress/sql/mvcc.sql
new file mode 100644
index 00000000000..b22a86dc5e5
--- /dev/null
+++ b/src/test/regress/sql/mvcc.sql
@@ -0,0 +1,44 @@
+--
+-- Verify that index scans encountering dead rows produced by an
+-- aborted subtransaction of the current transaction can utilize the
+-- kill_prio_tuple optimization
+--
+-- NB: The table size is currently *not* expected to stay the same, we
+-- don't have logic to trigger opportunistic pruning in cases like
+-- this.
+BEGIN;
+
+SET LOCAL enable_seqscan = false;
+SET LOCAL enable_indexonlyscan = false;
+SET LOCAL enable_bitmapscan = false;
+
+-- Can't easily use a unique index, since dead tuples can be found
+-- independent of the kill_prior_tuples optimization.
+CREATE TABLE clean_aborted_self(key int, data text);
+CREATE INDEX clean_aborted_self_key ON clean_aborted_self(key);
+INSERT INTO clean_aborted_self (key, data) VALUES (-1, 'just to allocate metapage');
+
+-- save index size from before the changes, for comparison
+SELECT pg_relation_size('clean_aborted_self_key') AS clean_aborted_self_key_before \gset
+
+DO $$
+BEGIN
+ -- iterate often enough to see index growth even on larger-than-default page sizes
+ FOR i IN 1..100 LOOP
+ BEGIN
+ -- perform index scan over all the inserted keys to get them to be seen as dead
+ IF EXISTS(SELECT * FROM clean_aborted_self WHERE key > 0 AND key < 100) THEN
+ RAISE data_corrupted USING MESSAGE = 'these rows should not exist';
+ END IF;
+ INSERT INTO clean_aborted_self SELECT g.i, 'rolling back in a sec' FROM generate_series(1, 100) g(i);
+ -- just some error that's not normally thrown
+ RAISE reading_sql_data_not_permitted USING MESSAGE = 'round and round again';
+ EXCEPTION WHEN reading_sql_data_not_permitted THEN END;
+ END LOOP;
+END;$$;
+
+-- show sizes only if they differ
+SELECT :clean_aborted_self_key_before AS size_before, pg_relation_size('clean_aborted_self_key') size_after
+WHERE :clean_aborted_self_key_before != pg_relation_size('clean_aborted_self_key');
+
+ROLLBACK;
--
2.31.0.121.g9198c13e34