Hi, If vacuum fails to remove a tuple with xmax older than VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed, it will ERROR out when determining whether or not to freeze the tuple with "cannot freeze committed xmax".
In back branches starting with 14, failing to remove tuples older than OldestXmin during pruning caused vacuum to infinitely loop in lazy_scan_prune(), as investigated on this [1] thread. On master, after 1ccc1e05ae removed the retry loop in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang could no longer happen, but we can still attempt to freeze dead tuples visibly killed before OldestXmin -- resulting in an ERROR. Pruning may fail to remove dead tuples with xmax before OldestXmin if the tuple is not considered removable by GlobalVisState. For vacuum, the GlobalVisState is initially calculated at the beginning of vacuuming the relation -- at the same time and with the same value as VacuumCutoffs->OldestXmin. A backend's GlobalVisState may be updated again when it is accessed if a new snapshot is taken or if something caused ComputeXidHorizons() to be called. This can happen, for example, at the end of a round of index vacuuming when GetOldestNonRemovableTransactionId() is called. Normally this may result in GlobalVisState's horizon moving forward -- potentially allowing more dead tuples to be removed. However, if a disconnected standby with a running transaction older than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum initially calculates GlobalVisState and OldestXmin but before GlobalVisState is updated, the value of GlobalVisState->maybe_needed could go backwards. If this happens in the middle of vacuum's first pruning and freezing pass, it is possible that pruning/freezing could then encounter a tuple whose xmax is younger than GlobalVisState->maybe_needed and older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum() would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it. But the heap_pre_freeze_checks() would ERROR out with "cannot freeze committed xmax". This check is to avoid freezing dead tuples. We can fix this by always removing tuples considered dead before VacuumCutoffs->OldestXmin. This is okay even if a reconnected standby has a transaction that sees that tuple as alive, because it will simply wait to replay the removal until it would be correct to do so or recovery conflict handling will cancel the transaction that sees the tuple as alive and allow replay to continue. Attached is the suggested fix for master plus a repro. I wrote it as a recovery suite TAP test, but I am _not_ proposing we add it to the ongoing test suite. It is, amongst other things, definitely prone to flaking. I also had to use loads of data to force two index vacuuming passes now that we have TIDStore, so it is a slow test. If you want to run the repro with meson, you'll have to add 't/099_vacuum_hang.pl' to src/test/recovery/meson.build and then run it with: meson test postgresql:recovery / recovery/099_vacuum_hang If you use autotools, you can run it with: make check PROVE_TESTS="t/099_vacuum_hang.pl" The repro forces a round of index vacuuming after the standby reconnects and before pruning a dead tuple whose xmax is older than OldestXmin. At the end of the round of index vacuuming, _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId(), thereby updating the backend's GlobalVisState and moving maybe_needed backwards. Then vacuum's first pass will continue with pruning and find our later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin. I make sure that the standby reconnects between vacuum_get_cutoffs() and pruning because I have a cursor on the page keeping VACUUM FREEZE from getting a cleanup lock. See the repro for step-by-step explanations of how it works. I have a modified version of this that repros the infinite loop on 14-16 with substantially less data. See it here [2]. Also, the repro attached to this mail won't work on 14 and 15 because of changes to background_psql. - Melanie [1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee [2] https://www.postgresql.org/message-id/CAAKRu_Y_NJzF4-8gzTTeaOuUL3CcGoXPjXcAHbTTygT8AyVqag%40mail.gmail.com
From fb74a07d98d19147556fca1dc911a22d50a496e5 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Thu, 20 Jun 2024 16:38:30 -0400 Subject: [PATCH v1 1/2] Repro for vacuum ERROR out trying to freeze old dead tuple This repro is not stable enough to be added as a test to the regression suite. It is for demonstration purposes only. --- src/test/recovery/t/099_vacuum_hang.pl | 267 +++++++++++++++++++++++++ 1 file changed, 267 insertions(+) create mode 100644 src/test/recovery/t/099_vacuum_hang.pl diff --git a/src/test/recovery/t/099_vacuum_hang.pl b/src/test/recovery/t/099_vacuum_hang.pl new file mode 100644 index 00000000000..ed491dad432 --- /dev/null +++ b/src/test/recovery/t/099_vacuum_hang.pl @@ -0,0 +1,267 @@ +use strict; +use warnings; +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; +use Test::More; + +use IPC::Run qw(pump); + +# Set up nodes +my $node_primary = PostgreSQL::Test::Cluster->new('primary'); +$node_primary->init(allows_streaming => 'physical'); + +my $tablespace1 = "test_vacuum_hang_tblspc"; + +$node_primary->append_conf( + 'postgresql.conf', qq[ +hot_standby_feedback = on +log_recovery_conflict_waits = true +log_statement='all' +log_connections=true +log_lock_waits = true +autovacuum = off +maintenance_work_mem = 1024 +]); +$node_primary->start; + +my $backup_name = 'my_backup'; + +$node_primary->backup($backup_name); +my $node_replica = PostgreSQL::Test::Cluster->new('standby'); +$node_replica->init_from_backup($node_primary, $backup_name, + has_streaming => 1); + +$node_replica->start; + +my $test_db = "test_db"; +$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db"); + +my $orig_conninfo = $node_primary->connstr(); + +my $table1 = "test_vacuum_hang_table"; +my $index1 = "test_vacuum_hang_index"; +my $col1 = "col1"; + +my $psql_timeout = IPC::Run::timer(15); + +# Long-running Primary Session A +my $psql_primaryA = + $node_primary->background_psql($test_db, on_error_stop => 0); + +$psql_primaryA->query_safe("set application_name=A;"); + +# Long-running Primary Session B +my $psql_primaryB = + $node_primary->background_psql($test_db, on_error_stop => 0); + +$psql_primaryB->query_safe("set application_name=B;"); + +# Long-running Replica Session A +my $psql_replicaA = + $node_replica->background_psql($test_db, on_error_stop => 0); + +# Insert one tuple with value 1 which we can use to make sure the cursor has +# successfully pinned and locked the buffer. +# +# Then insert and update enough rows that we force at least one round of index +# vacuuming before getting to a dead tuple which was killed after the standby +# is disconnected. +# +# Multiple index vacuuming passes is required to repro because after the +# standby reconnects to the primary, our backend's GlobalVisStates will not +# have been updated with the new horizon until an update is forced. +# +# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the +# end of a round of index vacuuming, updating the backend's GlobalVisState +# and, in our case, moving maybe_needed backwards. +# +# Then vacuum's first pass will continue and pruning will find our later +# inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to +# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin. +$node_primary->safe_psql($test_db, qq[ + CREATE TABLE ${table1}(${col1} int) with (autovacuum_enabled=false, fillfactor=10); + INSERT INTO $table1 VALUES (1); + INSERT INTO $table1 SELECT i & 3 FROM generate_series(2, 400000) i; + CREATE INDEX ${index1} on ${table1}(${col1}); + UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3; + UPDATE $table1 SET $col1 = 3 WHERE $col1 = 0; + UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3; + UPDATE $table1 SET $col1 = 3 WHERE $col1 = 0; + UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3; + UPDATE $table1 SET $col1 = 0 WHERE $col1 = 3; +]); + +my $primary_lsn = $node_primary->lsn('flush'); +$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn); + +my $walreceiver_pid = $node_replica->safe_psql($test_db, qq[ + select pid from pg_stat_activity where backend_type = 'walreceiver';]); + +# Set primary_conninfo to something invalid on the replica and reload the +# config. This will prevent the standby from reconnecting once the connection +# is terminated. Then terminate the wal receiver. When a new WAL receiver +# process starts up, it has to use the primary_conninfo to connect to the +# primary and it will be unable to do so. +$node_replica->safe_psql($test_db, qq[ + ALTER SYSTEM SET primary_conninfo = ''; + SELECT pg_reload_conf(); + SELECT pg_terminate_backend($walreceiver_pid)]); + +# Ensure the WAL receiver is no longer active on replica. +$node_replica->poll_query_until($test_db, qq[ + select exists (select * from pg_stat_activity where pid = $walreceiver_pid);] , 'f'); + +# DECLARE and use a cursor on standby, causing the block of the relation to be +# pinned and locked in a buffer on the replica. It is important that this is +# after termination of the WAL receiver so that the primary does not know about +# the cursor and it can't hold back the horizon on the primary. +my $replica_cursor1 = "test_vacuum_hang_cursor_replica"; + +# FETCH FORWARD should return a 1. That's how we know the cursor has a pin and +# lock. +$psql_replicaA->query_until( + qr/^1$/m, + qq[ + BEGIN; + DECLARE $replica_cursor1 CURSOR FOR SELECT $col1 FROM $table1; + FETCH FORWARD FROM $replica_cursor1; + ] + ); + +# Now insert and update a tuple which will be visible to the vacuum on the +# primary but which will have xmax newer than the standby that was recently +# disconnected. +my $res = $psql_primaryA->query_safe( + qq[ + INSERT INTO $table1 VALUES (99); + UPDATE $table1 SET $col1 = 100 WHERE $col1 = 99; + SELECT 'after_update'; + ] + ); + +## Make sure the UPDATE finished +like($res, qr/^after_update$/m, "UPDATE occurred on primary session A"); + +# Open a cursor on the primary whose lock will keep VACUUM from getting a +# cleanup lock on the first page of the relation. We want VACUUM to be able to +# start, calculate initial values for OldestXmin and GlobalVisState and then be +# unable to proceed with pruning our dead tuples. This will allow us to +# reconnect the standby and push the horizon back before we start actual +# pruning and vacuuming. +my $primary_cursor1 = "test_vacuum_hang_cursor_primary"; + +# FETCH FORWARD should return a 1. That's how we know the cursor has a pin and +# lock. +$res = $psql_primaryB->query_until( + qr/^1$/m, + qq[ + BEGIN; + DECLARE $primary_cursor1 CURSOR FOR SELECT $col1 FROM $table1; + FETCH FORWARD FROM $primary_cursor1; + ] + ); + +my $vac_count_before = $node_primary->safe_psql($test_db, + qq[ + SELECT vacuum_count + FROM pg_stat_all_tables WHERE relname = '${table1}'; + ]); + +# Now start a VACUUM FREEZE on the primary. It will be unable to get a cleanup +# lock and start pruning, so it will hang. It will call +# vacuum_get_cutoffs() and establish values of OldestXmin and +# GlobalVisState which are newer than all of our dead tuples. +$psql_primaryA->{stdin} .= qq[ + VACUUM FREEZE $table1; + \\echo VACUUM + ]; + +$psql_primaryA->{run}->pump_nb(); + +# Make sure that the VACUUM has already called vacuum_get_cutoffs() and is just +# waiting on the cleanup lock to start vacuuming. pg_stat_progress_vacuum will +# have been updated to phase 'scanning heap'. We don't want the standby to +# re-establish a connection to the primary and push the horizon back until +# we've saved initial values in GlobalVisState and calculated OldestXmin. +# If for some reason this doesn't work, I found that throwing in a sleep helps. +$node_primary->poll_query_until($test_db, + qq[ SELECT count(*) >= 1 + FROM pg_stat_progress_vacuum + WHERE phase = 'scanning heap' AND + '${table1}'::regclass::oid = relid; + ] , + 't'); + +# Ensure the WAL receiver is still not active on the replica. +$node_replica->poll_query_until($test_db, qq[ + select exists (select * from pg_stat_wal_receiver);] , 'f'); + +# Allow the WAL receiver connection to re-establish. +$node_replica->safe_psql( + $test_db, qq[ + ALTER SYSTEM SET primary_conninfo = '$orig_conninfo'; + SELECT pg_reload_conf(); + ]); + +$node_replica->poll_query_until($test_db, qq[ + select exists (select * from pg_stat_wal_receiver);] , 't'); + +# Once the WAL sender is shown on the primary, the replica should have +# connected with the primary and pushed the horizon backward. Primary Session A +# won't see that until the VACUUM FREEZE proceeds and does its first round of +# index vacuuming. +$node_primary->poll_query_until($test_db, qq[ + select exists (select * from pg_stat_replication);] , 't'); + +# Commit the cursor so that the VACUUM can proceed. +$psql_primaryB->query_until( + qr/^commit$/m, + qq[ + COMMIT; + \\echo commit + ] + ); + +$psql_primaryA->{run}->pump_nb(); + +# VACUUM proceeds with pruning and does a visibility check on each tuple. It +# will find our final dead tuple non-removable (HEAPTUPLE_RECENTLY_DEAD) since +# its xmax is after the new value of maybe_needed. Without the fix, after +# pruning, in lazy_scan_prune(), vacuum does another visibility check, this +# time with HeapTupleSatisfiesVacuum() which compares dead_after to OldestXmin. +# It will find the tuple HEAPTUPLE_DEAD since its xmax precedes OldestXmin. +# This will cause the infinite loop. +pump $psql_primaryA->{run} until ($psql_primaryA->{stdout} =~ /VACUUM/ || $psql_timeout->is_expired); + +ok(!$psql_timeout->is_expired); + +my $vac_count_after = $node_primary->safe_psql($test_db, + qq[ + SELECT vacuum_count + FROM pg_stat_all_tables WHERE relname = '${table1}'; + ]); + +# Ensure that that the VACUUM FREEZE completed successfully +cmp_ok($vac_count_after, '>', $vac_count_before); + +# Commit the original cursor transaction on the replica so it can catch up. it +# will end up replaying the VACUUM and removing the tuple too. +$psql_replicaA->query_safe( qq[ COMMIT; ] ); + +$psql_replicaA->{run}->pump_nb(); + +$primary_lsn = $node_primary->lsn('flush'); + +# Make sure something causes us to flush +$node_primary->safe_psql($test_db, "INSERT INTO $table1 VALUES (1);"); +$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn); + +## Shut down psqls +$psql_primaryA->quit; +$psql_primaryB->quit; +$psql_replicaA->quit; + +$node_replica->stop(); +$node_primary->stop(); + +done_testing(); -- 2.34.1
From 03e98bcbd1e30815f2219f0d21da0da1758fe0a0 Mon Sep 17 00:00:00 2001 From: Melanie Plageman <melanieplage...@gmail.com> Date: Thu, 20 Jun 2024 17:21:11 -0400 Subject: [PATCH v1 2/2] Ensure vacuum removes all dead tuples older than OldestXmin If vacuum fails to remove a tuple with xmax older than VacuumCutoffs->OldestXmin and younger than GlobalVisState->maybe_needed, it will ERROR out when determining whether or not to freeze the tuple with "cannot freeze committed xmax". Fix this by having vacuum alway remove tuples older than OldestXmin. Further explanation: Pruning uses the GlobalVisState (via GlobalVisTestIsRemovableXid()) to determine which dead tuples are removable. Pruning initiated by vacuum also considers whether or not tuples should be frozen. Tuples which are HEAPTUPLE_RECENTLY_DEAD may be frozen if their xmax is older than VacuumCutoffs->OldestXmin -- which is determined at the beginning of vacuuming the relation. For vacuum, the GlobalVisState is updated at the beginning of vacuuming the relation -- at the same time and with the same value as VacuumCutoffs->OldestXmin. A backend's GlobalVisState may be updated when it is accessed if a new snapshot is taken or if something caused ComputeXidHorizons() to be called. This can happen during vacuum at the end of a round of index vacuuming when GetOldestNonRemovableTransactionId() is called. If a disconnected standby with a running transaction older than VacuumCutoffs->OldestXmin reconnects to the primary after vacuum initially calculates GlobalVisState and OldestXmin but before GlobalVisState is updated, the value of GlobalVisState->maybe_needed could go backwards. If this happens in the middle of vacuum's first pruning and freezing pass, it is possible that pruning/freezing could then encounter a tuple whose xmax is younger than GlobalVisState->maybe_needed and older than VacuumCutoffs->OldestXmin. heap_prune_satisfies_vacuum() would deem the tuple HEAPTUPLE_RECENTLY_DEAD and would not remove it. But the heap_pre_freeze_checks() would ERROR out with "cannot freeze committed xmax". This check is to avoid ever freezing dead tuples. In back branches starting with 14, failing to remove tuples older than OldestXmin during pruning caused vacuum to infinitely loop in lazy_scan_prune(), as investigated on this [1] thread. After 1ccc1e05ae removed the retry loop in lazy_scan_prune() and stopped comparing tuples to OldestXmin, the hang could no longer happen, but we could still attempt to freeze dead tuples with xmax older than OldestXmin -- resulting in an ERROR. Fix this by always removing dead tuples with xmax older than VacuumCutoffs->OldestXmin. This is okay because the standby won't replay the tuple removal until that tuple is removable on the standby. Thus, the worst that can happen is a recovery conflict. [1] https://postgr.es/m/20240415173913.4zyyrwaftujxthf2%40awork3.anarazel.de#1b216b7768b5bd577a3d3d51bd5aadee --- src/backend/access/heap/pruneheap.c | 22 +++++++++++++++++++++- src/backend/access/heap/vacuumlazy.c | 14 +++++++------- 2 files changed, 28 insertions(+), 8 deletions(-) diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c index 3cdfc5b7f1b..72e5fe6dad8 100644 --- a/src/backend/access/heap/pruneheap.c +++ b/src/backend/access/heap/pruneheap.c @@ -325,6 +325,8 @@ heap_page_prune_opt(Relation relation, Buffer buffer) * * cutoffs contains the freeze cutoffs, established by VACUUM at the beginning * of vacuuming the relation. Required if HEAP_PRUNE_FREEZE option is set. + * cutoffs->OldestXmin is also used to determine if dead tuples are + * HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD. * * presult contains output parameters needed by callers, such as the number of * tuples removed and the offsets of dead items on the page after pruning. @@ -922,8 +924,26 @@ heap_prune_satisfies_vacuum(PruneState *prstate, HeapTuple tup, Buffer buffer) if (res != HEAPTUPLE_RECENTLY_DEAD) return res; + /* + * For VACUUM, we must be sure to prune tuples whose xmax is older than + * OldestXmin -- a visibility cutoff determined at the beginning of + * vacuuming the relation. OldestXmin is used to decide whether or not to + * freeze a tuple, and we should not freeze dead tuples. + */ + if (prstate->cutoffs && + TransactionIdIsValid(prstate->cutoffs->OldestXmin) && + NormalTransactionIdPrecedes(dead_after, prstate->cutoffs->OldestXmin)) + return HEAPTUPLE_DEAD; + + /* + * HOT pruning uses the GlobalVisState to determine if the tuple is dead. + * And for vacuum, even if the tuple's xmax is not older than OldestXmin, + * GlobalVisTestIsRemovableXid() could find the row dead if the + * GlobalVisState has been updated since the beginning of vacuuming the + * relation. + */ if (GlobalVisTestIsRemovableXid(prstate->vistest, dead_after)) - res = HEAPTUPLE_DEAD; + return HEAPTUPLE_DEAD; return res; } diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 3f88cf1e8ef..abb47ae5960 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -438,13 +438,13 @@ heap_vacuum_rel(Relation rel, VacuumParams *params, * as an upper bound on the XIDs stored in the pages we'll actually scan * (NewRelfrozenXid tracking must never be allowed to miss unfrozen XIDs). * - * Next acquire vistest, a related cutoff that's used in pruning. We - * expect vistest will always make heap_page_prune_and_freeze() remove any - * deleted tuple whose xmax is < OldestXmin. lazy_scan_prune must never - * become confused about whether a tuple should be frozen or removed. (In - * the future we might want to teach lazy_scan_prune to recompute vistest - * from time to time, to increase the number of dead tuples it can prune - * away.) + * Next acquire vistest, a related cutoff that's used in pruning. We use + * vistest in combination with OldestXmin to ensure that + * heap_page_prune_and_freeze() always removes any deleted tuple whose + * xmax is < OldestXmin. lazy_scan_prune must never become confused about + * whether a tuple should be frozen or removed. (In the future we might + * want to teach lazy_scan_prune to recompute vistest from time to time, + * to increase the number of dead tuples it can prune away.) */ vacrel->aggressive = vacuum_get_cutoffs(rel, params, &vacrel->cutoffs); vacrel->rel_pages = orig_rel_pages = RelationGetNumberOfBlocks(rel); -- 2.34.1