Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Melanie Plageman Thu, 20 Jun 2024 16:42:42 -0700

Hi,

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than
GlobalVisState->maybe_needed, it will ERROR out when determining
whether or not to freeze the tuple with "cannot freeze committed
xmax".

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread.

On master, after 1ccc1e05ae removed the retry loop in
lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang
could no longer happen, but we can still attempt to freeze dead tuples
visibly killed before OldestXmin -- resulting in an ERROR.

Pruning may fail to remove dead tuples with xmax before OldestXmin if
the tuple is not considered removable by GlobalVisState.

For vacuum, the GlobalVisState is initially calculated at the
beginning of vacuuming the relation -- at the same time and with the
same value as VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated again when it is accessed if
a new snapshot is taken or if something caused ComputeXidHorizons() to
be called.

This can happen, for example, at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

Normally this may result in GlobalVisState's horizon moving forward --
potentially allowing more dead tuples to be removed.

However, if a disconnected standby with a running transaction older
than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a
tuple whose xmax is younger than GlobalVisState->maybe_needed and
older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum()
would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it.
But the heap_pre_freeze_checks() would ERROR out with "cannot freeze
committed xmax". This check is to avoid freezing dead tuples.

We can fix this by always removing tuples considered dead before
VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby
has a transaction that sees that tuple as alive, because it will
simply wait to replay the removal until it would be correct to do so
or recovery conflict handling will cancel the transaction that sees
the tuple as alive and allow replay to continue.

Attached is the suggested fix for master plus a repro. I wrote it as a
recovery suite TAP test, but I am _not_ proposing we add it to the
ongoing test suite. It is, amongst other things, definitely prone to
flaking. I also had to use loads of data to force two index vacuuming
passes now that we have TIDStore, so it is a slow test.

If you want to run the repro with meson, you'll have to add
't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with:

meson test postgresql:recovery / recovery/099_vacuum_hang

If you use autotools, you can run it with:
make check PROVE_TESTS="t/099_vacuum_hang.pl"

The repro forces a round of index vacuuming after the standby
reconnects and before pruning a dead tuple whose xmax is older than
OldestXmin.

At the end of the round of index vacuuming, _bt_pendingfsm_finalize()
calls GetOldestNonRemovableTransactionId(), thereby updating the
backend's GlobalVisState and moving maybe_needed backwards.

Then vacuum's first pass will continue with pruning and find our later
inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.

I make sure that the standby reconnects between vacuum_get_cutoffs()
and pruning because I have a cursor on the page keeping VACUUM FREEZE
from getting a cleanup lock.

See the repro for step-by-step explanations of how it works.

I have a modified version of this that repros the infinite loop on
14-16 with substantially less data. See it here [2]. Also, the repro
attached to this mail won't work on 14 and 15 because of changes to
background_psql.

- Melanie

[1]
https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
[2]
https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com

From fb74a07d98d19147556fca1dc911a22d50a496e5 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 20 Jun 2024 16:38:30 -0400
Subject: [PATCH v1 1/2] Repro for vacuum ERROR out trying to freeze old dead
 tuple

This repro is not stable enough to be added as a test to the regression
suite. It is for demonstration purposes only.
---
 src/test/recovery/t/099_vacuum_hang.pl | 267 +++++++++++++++++++++++++
 1 file changed, 267 insertions(+)
 create mode 100644 src/test/recovery/t/099_vacuum_hang.pl

diff --git a/src/test/recovery/t/099_vacuum_hang.pl b/src/test/recovery/t/099_vacuum_hang.pl
new file mode 100644
index 00000000000..ed491dad432
--- /dev/null
+++ b/src/test/recovery/t/099_vacuum_hang.pl
@@ -0,0 +1,267 @@
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+use IPC::Run qw(pump);
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 'physical');
+
+my $tablespace1 = "test_vacuum_hang_tblspc";
+
+$node_primary->append_conf(
+	'postgresql.conf', qq[
+hot_standby_feedback = on
+log_recovery_conflict_waits = true
+log_statement='all'
+log_connections=true
+log_lock_waits = true
+autovacuum = off
+maintenance_work_mem = 1024
+]);
+$node_primary->start;
+
+my $backup_name = 'my_backup';
+
+$node_primary->backup($backup_name);
+my $node_replica = PostgreSQL::Test::Cluster->new('standby');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_streaming => 1);
+
+$node_replica->start;
+
+my $test_db = "test_db";
+$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
+
+my $orig_conninfo = $node_primary->connstr();
+
+my $table1 = "test_vacuum_hang_table";
+my $index1 = "test_vacuum_hang_index";
+my $col1 = "col1";
+
+my $psql_timeout = IPC::Run::timer(15);
+
+# Long-running Primary Session A
+my $psql_primaryA =
+  $node_primary->background_psql($test_db, on_error_stop => 0);
+
+$psql_primaryA->query_safe("set application_name=A;");
+
+# Long-running Primary Session B
+my $psql_primaryB  =
+  $node_primary->background_psql($test_db, on_error_stop => 0);
+
+$psql_primaryB->query_safe("set application_name=B;");
+
+# Long-running Replica Session A
+my $psql_replicaA =
+  $node_replica->background_psql($test_db, on_error_stop => 0);
+
+# Insert one tuple with value 1 which we can use to make sure the cursor has
+# successfully pinned and locked the buffer.
+#
+# Then insert and update enough rows that we force at least one round of index
+# vacuuming before getting to a dead tuple which was killed after the standby
+# is disconnected.
+#
+# Multiple index vacuuming passes is required to repro because after the
+# standby reconnects to the primary, our backend's GlobalVisStates will not
+# have been updated with the new horizon until an update is forced.
+#
+# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
+# end of a round of index vacuuming, updating the backend's GlobalVisState
+# and, in our case, moving maybe_needed backwards.
+#
+# Then vacuum's first pass will continue and pruning will find our later
+# inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
+# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
+$node_primary->safe_psql($test_db, qq[
+	CREATE TABLE ${table1}(${col1} int) with (autovacuum_enabled=false, fillfactor=10);
+	INSERT INTO $table1 VALUES (1);
+	INSERT INTO $table1 SELECT i & 3 FROM generate_series(2, 400000) i;
+	CREATE INDEX ${index1} on ${table1}(${col1});
+	UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3;
+	UPDATE $table1 SET $col1 = 3 WHERE $col1 = 0;
+	UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3;
+	UPDATE $table1 SET $col1 = 3 WHERE $col1 = 0;
+	UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3;
+	UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3;
+]);
+
+my $primary_lsn = $node_primary->lsn('flush');
+$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
+
+my $walreceiver_pid = $node_replica->safe_psql($test_db, qq[
+	select pid from pg_stat_activity where backend_type = 'walreceiver';]);
+
+# Set primary_conninfo to something invalid on the replica and reload the
+# config. This will prevent the standby from reconnecting once the connection
+# is terminated. Then terminate the wal receiver. When a new WAL receiver
+# process starts up, it has to use the primary_conninfo to connect to the
+# primary and it will be unable to do so.
+$node_replica->safe_psql($test_db, qq[
+		ALTER SYSTEM SET primary_conninfo = '';
+		SELECT pg_reload_conf();
+		SELECT pg_terminate_backend($walreceiver_pid)]);
+
+# Ensure the WAL receiver is no longer active on replica.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_activity where pid = $walreceiver_pid);] , 'f');
+
+# DECLARE and use a cursor on standby, causing the block of the relation to be
+# pinned and locked in a buffer on the replica. It is important that this is
+# after termination of the WAL receiver so that the primary does not know about
+# the cursor and it can't hold back the horizon on the primary.
+my $replica_cursor1 = "test_vacuum_hang_cursor_replica";
+
+# FETCH FORWARD should return a 1. That's how we know the cursor has a pin and
+# lock.
+$psql_replicaA->query_until(
+	qr/^1$/m,
+	qq[
+        BEGIN;
+			DECLARE $replica_cursor1 CURSOR FOR SELECT $col1 FROM $table1;
+			FETCH FORWARD FROM $replica_cursor1;
+        ]
+	);
+
+# Now insert and update a tuple which will be visible to the vacuum on the
+# primary but which will have xmax newer than the standby that was recently
+# disconnected.
+my $res = $psql_primaryA->query_safe(
+	qq[
+		INSERT INTO $table1 VALUES (99);
+		UPDATE $table1 SET $col1 = 100 WHERE $col1 = 99;
+		SELECT 'after_update';
+        ]
+	);
+
+## Make sure the UPDATE finished
+like($res, qr/^after_update$/m, "UPDATE occurred on primary session A");
+
+# Open a cursor on the primary whose lock will keep VACUUM from getting a
+# cleanup lock on the first page of the relation. We want VACUUM to be able to
+# start, calculate initial values for OldestXmin and GlobalVisState and then be
+# unable to proceed with pruning our dead tuples. This will allow us to
+# reconnect the standby and push the horizon back before we start actual
+# pruning and vacuuming.
+my $primary_cursor1 = "test_vacuum_hang_cursor_primary";
+
+# FETCH FORWARD should return a 1. That's how we know the cursor has a pin and
+# lock.
+$res = $psql_primaryB->query_until(
+	qr/^1$/m,
+	qq[
+        BEGIN;
+			DECLARE $primary_cursor1 CURSOR FOR SELECT $col1 FROM $table1;
+			FETCH FORWARD FROM $primary_cursor1;
+        ]
+	);
+
+my $vac_count_before = $node_primary->safe_psql($test_db,
+	qq[
+	SELECT vacuum_count
+	FROM pg_stat_all_tables WHERE relname = '${table1}';
+	]);
+
+# Now start a VACUUM FREEZE on the primary. It will be unable to get a cleanup
+# lock and start pruning, so it will hang. It will call
+# vacuum_get_cutoffs() and establish values of OldestXmin and
+# GlobalVisState which are newer than all of our dead tuples.
+$psql_primaryA->{stdin} .= qq[
+		VACUUM FREEZE $table1;
+		\\echo VACUUM
+        ];
+
+$psql_primaryA->{run}->pump_nb();
+
+# Make sure that the VACUUM has already called vacuum_get_cutoffs() and is just
+# waiting on the cleanup lock to start vacuuming. pg_stat_progress_vacuum will
+# have been updated to phase 'scanning heap'. We don't want the standby to
+# re-establish a connection to the primary and push the horizon back until
+# we've saved initial values in GlobalVisState and calculated OldestXmin.
+# If for some reason this doesn't work, I found that throwing in a sleep helps.
+$node_primary->poll_query_until($test_db,
+	qq[ SELECT count(*) >= 1
+		FROM pg_stat_progress_vacuum
+		WHERE phase = 'scanning heap' AND
+		'${table1}'::regclass::oid = relid;
+		] ,
+	't');
+
+# Ensure the WAL receiver is still not active on the replica.
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 'f');
+
+# Allow the WAL receiver connection to re-establish.
+$node_replica->safe_psql(
+	$test_db, qq[
+		ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
+		SELECT pg_reload_conf();
+	]);
+
+$node_replica->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_wal_receiver);] , 't');
+
+# Once the WAL sender is shown on the primary, the replica should have
+# connected with the primary and pushed the horizon backward. Primary Session A
+# won't see that until the VACUUM FREEZE proceeds and does its first round of
+# index vacuuming.
+$node_primary->poll_query_until($test_db, qq[
+	select exists (select * from pg_stat_replication);] , 't');
+
+# Commit the cursor so that the VACUUM can proceed.
+$psql_primaryB->query_until(
+		qr/^commit$/m,
+		qq[
+			COMMIT;
+			\\echo commit
+        ]
+	);
+
+$psql_primaryA->{run}->pump_nb();
+
+# VACUUM proceeds with pruning and does a visibility check on each tuple. It
+# will find our final dead tuple non-removable (HEAPTUPLE_RECENTLY_DEAD) since
+# its xmax is after the new value of maybe_needed. Without the fix, after
+# pruning, in lazy_scan_prune(), vacuum does another visibility check, this
+# time with HeapTupleSatisfiesVacuum() which compares dead_after to OldestXmin.
+# It will find the tuple HEAPTUPLE_DEAD since its xmax precedes OldestXmin.
+# This will cause the infinite loop.
+pump $psql_primaryA->{run} until ($psql_primaryA->{stdout} =~ /VACUUM/ || $psql_timeout->is_expired);
+
+ok(!$psql_timeout->is_expired);
+
+my $vac_count_after = $node_primary->safe_psql($test_db,
+	qq[
+	SELECT vacuum_count
+	FROM pg_stat_all_tables WHERE relname = '${table1}';
+	]);
+
+# Ensure that that the VACUUM FREEZE completed successfully
+cmp_ok($vac_count_after, '>', $vac_count_before);
+
+# Commit the original cursor transaction on the replica so it can catch up. it
+# will end up replaying the VACUUM and removing the tuple too.
+$psql_replicaA->query_safe( qq[ COMMIT; ] );
+
+$psql_replicaA->{run}->pump_nb();
+
+$primary_lsn = $node_primary->lsn('flush');
+
+# Make sure something causes us to flush
+$node_primary->safe_psql($test_db, "INSERT INTO $table1 VALUES (1);");
+$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
+
+## Shut down psqls
+$psql_primaryA->quit;
+$psql_primaryB->quit;
+$psql_replicaA->quit;
+
+$node_replica->stop();
+$node_primary->stop();
+
+done_testing();
-- 
2.34.1

From 03e98bcbd1e30815f2219f0d21da0da1758fe0a0 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Thu, 20 Jun 2024 17:21:11 -0400
Subject: [PATCH v1 2/2] Ensure vacuum removes all dead tuples older than
 OldestXmin

If vacuum fails to remove a tuple with xmax older than
VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed,
it will ERROR out when determining whether or not to freeze the tuple
with "cannot freeze committed xmax".

Fix this by having vacuum alway remove tuples older than OldestXmin.

Further explanation:

Pruning uses the GlobalVisState (via GlobalVisTestIsRemovableXid()) to
determine which dead tuples are removable. Pruning initiated by vacuum
also considers whether or not tuples should be frozen. Tuples which are
HEAPTUPLE_RECENTLY_DEAD may be frozen if their xmax is older than
VacuumCutoffs->OldestXmin -- which is determined at the beginning of
vacuuming the relation.

For vacuum, the GlobalVisState is updated at the beginning of vacuuming
the relation -- at the same time and with the same value as
VacuumCutoffs->OldestXmin.

A backend's GlobalVisState may be updated when it is accessed if a new
snapshot is taken or if something caused ComputeXidHorizons() to be
called.

This can happen during vacuum at the end of a round of index vacuuming
when GetOldestNonRemovableTransactionId() is called.

If a disconnected standby with a running transaction older than
VacuumCutoffs->OldestXmin reconnects to the primary after vacuum
initially calculates GlobalVisState and OldestXmin but before
GlobalVisState is updated, the value of GlobalVisState->maybe_needed
could go backwards.

If this happens in the middle of vacuum's first pruning and freezing
pass, it is possible that pruning/freezing could then encounter a tuple
whose xmax is younger than GlobalVisState->maybe_needed and older than
VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum() would deem the
tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it. But the
heap_pre_freeze_checks() would ERROR out with "cannot freeze committed
xmax". This check is to avoid ever freezing dead tuples.

In back branches starting with 14, failing to remove tuples older than
OldestXmin during pruning caused vacuum to infinitely loop in
lazy_scan_prune(), as investigated on this [1] thread. After 1ccc1e05ae
removed the retry loop in lazy_scan_prune() and stopped comparing tuples
to OldestXmin, the hang could no longer happen, but we could still
attempt to freeze dead tuples with xmax older than OldestXmin --
resulting in an ERROR.

Fix this by always removing dead tuples with xmax older than
VacuumCutoffs->OldestXmin. This is okay because the standby won't replay
the tuple removal until that tuple is removable on the standby. Thus,
the worst that can happen is a recovery conflict.

[1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee
---
 src/backend/access/heap/pruneheap.c  | 22 +++++++++++++++++++++-
 src/backend/access/heap/vacuumlazy.c | 14 +++++++-------
 2 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 3cdfc5b7f1b..72e5fe6dad8 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -325,6 +325,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer)
  *
  * cutoffs contains the freeze cutoffs, established by VACUUM at the beginning
  * of vacuuming the relation.  Required if HEAP_PRUNE_FREEZE option is set.
+ * cutoffs->OldestXmin is also used to determine if dead tuples are
+ * HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD.
  *
  * presult contains output parameters needed by callers, such as the number of
  * tuples removed and the offsets of dead items on the page after pruning.
@@ -922,8 +924,26 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer)
 	if (res != HEAPTUPLE_RECENTLY_DEAD)
 		return res;

+	/*
+	 * For VACUUM, we must be sure to prune tuples whose xmax is older than
+	 * OldestXmin -- a visibility cutoff determined at the beginning of
+	 * vacuuming the relation. OldestXmin is used to decide whether or not to
+	 * freeze a tuple, and we should not freeze dead tuples.
+	 */
+	if (prstate->cutoffs &&
+		TransactionIdIsValid(prstate->cutoffs->OldestXmin) &&
+		NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin))
+		return HEAPTUPLE_DEAD;
+
+	/*
+	 * HOT pruning uses the GlobalVisState to determine if the tuple is dead.
+	 * And for vacuum, even if the tuple's xmax is not older than OldestXmin,
+	 * GlobalVisTestIsRemovableXid() could find the row dead if the
+	 * GlobalVisState has been updated since the beginning of vacuuming the
+	 * relation.
+	 */
 	if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after))
-		res = HEAPTUPLE_DEAD;
+		return HEAPTUPLE_DEAD;

 	return res;
 }
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 3f88cf1e8ef..abb47ae5960 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -438,13 +438,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params,
 	 * as an upper bound on the XIDs stored in the pages we'll actually scan
 	 * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs).
 	 *
-	 * Next acquire vistest, a related cutoff that's used in pruning.  We
-	 * expect vistest will always make heap_page_prune_and_freeze() remove any
-	 * deleted tuple whose xmax is < OldestXmin.  lazy_scan_prune must never
-	 * become confused about whether a tuple should be frozen or removed.  (In
-	 * the future we might want to teach lazy_scan_prune to recompute vistest
-	 * from time to time, to increase the number of dead tuples it can prune
-	 * away.)
+	 * Next acquire vistest, a related cutoff that's used in pruning.  We use
+	 * vistest in combination with OldestXmin to ensure that
+	 * heap_page_prune_and_freeze() always removes any deleted tuple whose
+	 * xmax is < OldestXmin.  lazy_scan_prune must never become confused about
+	 * whether a tuple should be frozen or removed.  (In the future we might
+	 * want to teach lazy_scan_prune to recompute vistest from time to time,
+	 * to increase the number of dead tuples it can prune away.)
 	 */
 	vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs);
 	vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel);
-- 
2.34.1

Vacuum ERRORs out considering freezing dead tuples from before OldestXmin

Reply via email to