> > That's a good point. I agree these new cases are very close to what we > > already have on HEAD. I can rework this to avoid duplication by > > grouping them with the existing tests. > > Thanks. I am not sure if I will be able to follow up the work of this > thread for this commit fest, so I'll probably consider more > improvements in this area once v20 opens for business around the end > of June.
I went ahead and reworked this to avoid duplication by grouping the new coverage into the existing tests that already carry most of the setup. For the missing-redo path, as discussed above, I added the backup_label variant to t/050_redo_segment_missing.pl. 050 already has the injection-point setup needed to split redo and checkpoint records across WAL segments, so reusing it avoids duplicating that orchestration elsewhere. For the missing-checkpoint path, I added the backup_label variant to t/042_low_level_backup.pl. 042 already owns the low-level-backup + backup_label setup, so it seemed like the natural place. 052 remains focused on the no-backup_label checkpoint-missing path. So the intent was to keep each new case close to the existing test that already has most of the required setup, and minimize duplication. Kindly review and share your feedback. Best Regards, Nitin Jadhav Azure Database for PostgreSQL Microsoft
From cf61831e64b97c3fc5a318426fe637488aaf1606 Mon Sep 17 00:00:00 2001 From: Nitin Jadhav <[email protected]> Date: Fri, 26 Jun 2026 06:12:09 +0000 Subject: [PATCH] test: cover backup_label missing-checkpoint and missing-redo path Extend recovery TAP coverage for backup_label-driven startup failures. In t/042_low_level_backup.pl, add a backup_label case where the checkpoint segment is missing and verify startup fails with the expected FATAL. In t/050_redo_segment_missing.pl, add a backup_label case for a missing redo location referenced by a checkpoint record, using a deterministic checkpoint/redo segment split and low-level backup setup. Keep no-backup_label checkpoint-missing coverage in t/052_checkpoint_segment_missing.pl, and place backup_label-specific coverage in t/042_low_level_backup.pl and t/050_redo_segment_missing.pl to avoid duplicating scope. --- src/test/recovery/t/042_low_level_backup.pl | 22 ++ .../recovery/t/050_redo_segment_missing.pl | 208 +++++++++++++----- 2 files changed, 181 insertions(+), 49 deletions(-) diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl index df4ae029fe6..0dfb3473204 100644 --- a/src/test/recovery/t/042_low_level_backup.pl +++ b/src/test/recovery/t/042_low_level_backup.pl @@ -79,6 +79,10 @@ copy($node_primary->data_dir . '/global/pg_control', my $stop_segment_name = $node_primary->safe_psql('postgres', 'SELECT pg_walfile_name(pg_current_wal_lsn())'); +# Save the segment holding the latest checkpoint record from pg_control. +my $checkpoint_segment_name = $node_primary->safe_psql('postgres', + 'SELECT pg_walfile_name(checkpoint_lsn) FROM pg_control_checkpoint()'); + # Stop backup and get backup_label, the last segment is archived. my $backup_label = $psql->query_safe("select labelfile from pg_backup_stop()"); @@ -141,4 +145,22 @@ is($node_replica->safe_psql('postgres', $canary_query), ok($node_replica->log_contains('starting backup recovery with redo LSN'), 'verify backup recovery performed with backup_label'); +# Recover with backup_label and with the checkpoint segment removed. +# Startup must end with FATAL on missing required checkpoint record. +$node_replica = PostgreSQL::Test::Cluster->new('replica_missing_checkpoint'); +$node_replica->init_from_backup($node_primary, $backup_name); +$node_replica->append_conf('postgresql.conf', "archive_mode = off"); + +if (-e $node_replica->data_dir . "/pg_wal/$checkpoint_segment_name") +{ + unlink($node_replica->data_dir . "/pg_wal/$checkpoint_segment_name") + or BAIL_OUT("unable to unlink $checkpoint_segment_name"); +} + +is($node_replica->start(fail_ok => 1), 0, + 'startup fails when checkpoint record WAL is missing'); + +ok($node_replica->log_contains('FATAL: could not locate required checkpoint record at'), + 'ends with FATAL for missing required checkpoint record'); + done_testing(); diff --git a/src/test/recovery/t/050_redo_segment_missing.pl b/src/test/recovery/t/050_redo_segment_missing.pl index e07ff0c72fe..29ea369eaef 100644 --- a/src/test/recovery/t/050_redo_segment_missing.pl +++ b/src/test/recovery/t/050_redo_segment_missing.pl @@ -6,6 +6,7 @@ use strict; use warnings FATAL => 'all'; +use File::Copy qw(copy); use PostgreSQL::Test::Cluster; use PostgreSQL::Test::Utils; use Test::More; @@ -29,64 +30,78 @@ if (!$node->check_extension('injection_points')) } $node->safe_psql('postgres', q(CREATE EXTENSION injection_points)); -# Note that this uses two injection points based on waits, not one. This -# may look strange, but this works as a workaround to enforce all memory -# allocations to happen outside the critical section of the checkpoint -# required for this test. -# First, "create-checkpoint-initial" is run outside the critical section -# section, and is used as a way to initialize the shared memory required -# for the wait machinery with its DSM registry. -# Then, "create-checkpoint-run" is loaded outside the critical section of -# a checkpoint to allocate any memory required by the library load, and -# its callback is run inside the critical section. -$node->safe_psql('postgres', - q{select injection_points_attach('create-checkpoint-initial', 'wait')}); -$node->safe_psql('postgres', - q{select injection_points_attach('create-checkpoint-run', 'wait')}); - -# Start a psql session to run the checkpoint in the background and make -# the test wait on the injection point so the checkpoint stops just after -# it starts. -my $checkpoint = $node->background_psql('postgres'); -$checkpoint->query_until( - qr/starting_checkpoint/, - q(\echo starting_checkpoint +sub run_split_checkpoint +{ + my ($node) = @_; + + # Note that this uses two injection points based on waits, not one. This + # may look strange, but this works as a workaround to enforce all memory + # allocations to happen outside the critical section of the checkpoint + # required for this test. + # First, "create-checkpoint-initial" is run outside the critical section, + # and is used as a way to initialize the shared memory required for the wait + # machinery with its DSM registry. + # Then, "create-checkpoint-run" is loaded outside the critical section of a + # checkpoint to allocate any memory required by the library load, and its + # callback is run inside the critical section. + $node->safe_psql('postgres', + q{select injection_points_attach('create-checkpoint-initial', 'wait')}); + $node->safe_psql('postgres', + q{select injection_points_attach('create-checkpoint-run', 'wait')}); + + my $checkpoint = $node->background_psql('postgres'); + $checkpoint->query_until( + qr/starting_checkpoint/, + q(\echo starting_checkpoint checkpoint; )); -# Wait for the initial point to finish, the checkpointer is still -# outside its critical section. Then release to reach the second -# point. -$node->wait_for_event('checkpointer', 'create-checkpoint-initial'); -$node->safe_psql('postgres', - q{select injection_points_wakeup('create-checkpoint-initial')}); + # Wait for the initial point to finish, the checkpointer is still outside + # its critical section. Then release to reach the second point. + $node->wait_for_event('checkpointer', 'create-checkpoint-initial'); + $node->safe_psql('postgres', + q{select injection_points_wakeup('create-checkpoint-initial')}); + + # Wait until the checkpoint has reached the second injection point. We are + # now in the middle of a checkpoint running, after the redo record has been + # logged. + $node->wait_for_event('checkpointer', 'create-checkpoint-run'); -# Wait until the checkpoint has reached the second injection point. -# We are now in the middle of a checkpoint running, after the redo -# record has been logged. -$node->wait_for_event('checkpointer', 'create-checkpoint-run'); + # Split redo and checkpoint records across WAL segments. + $node->safe_psql('postgres', 'SELECT pg_switch_wal()'); -# Switch the WAL segment, ensuring that the redo record will be included -# in a different segment than the checkpoint record. -$node->safe_psql('postgres', 'SELECT pg_switch_wal()'); + # Continue the checkpoint and wait for its completion. + my $log_offset = -s $node->logfile; + $node->safe_psql('postgres', + q{select injection_points_wakeup('create-checkpoint-run')}); + $node->wait_for_log(qr/checkpoint complete/, $log_offset); -# Continue the checkpoint and wait for its completion. -my $log_offset = -s $node->logfile; -$node->safe_psql('postgres', - q{select injection_points_wakeup('create-checkpoint-run')}); -$node->wait_for_log(qr/checkpoint complete/, $log_offset); + $checkpoint->quit; +} -$checkpoint->quit; +sub checkpoint_wal_info +{ + my ($node) = @_; + + # Capture the redo/checkpoint LSNs and their segment names from pg_control. + my $redo_lsn = $node->safe_psql('postgres', + "SELECT redo_lsn FROM pg_control_checkpoint()"); + my $redo_walfile_name = + $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')"); + my $checkpoint_lsn = $node->safe_psql('postgres', + "SELECT checkpoint_lsn FROM pg_control_checkpoint()"); + my $checkpoint_walfile_name = + $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')"); + + return ($redo_lsn, $redo_walfile_name, $checkpoint_lsn, + $checkpoint_walfile_name); +} + +run_split_checkpoint($node); # Retrieve the WAL file names for the redo record and checkpoint record. -my $redo_lsn = $node->safe_psql('postgres', - "SELECT redo_lsn FROM pg_control_checkpoint()"); -my $redo_walfile_name = - $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')"); -my $checkpoint_lsn = $node->safe_psql('postgres', - "SELECT checkpoint_lsn FROM pg_control_checkpoint()"); -my $checkpoint_walfile_name = - $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')"); +my ($redo_lsn, $redo_walfile_name, $checkpoint_lsn, + $checkpoint_walfile_name) = checkpoint_wal_info($node); # Redo record and checkpoint record should be on different segments. isnt($redo_walfile_name, $checkpoint_walfile_name, @@ -114,4 +129,99 @@ ok( $logfile =~ qr/FATAL: .* could not find redo location .* referenced by checkpoint record at .*/, "ends with FATAL because it could not find redo location"); +# Repeat in backup_label-present path by creating a low-level backup from a +# dedicated node and forcing redo to different segments during CHECKPOINT +# while backup is in progress. +my $node_bkp = PostgreSQL::Test::Cluster->new('testnode_backup'); +$node_bkp->init(has_archiving => 1, allows_streaming => 1); +$node_bkp->append_conf('postgresql.conf', 'log_checkpoints = on'); +$node_bkp->start; +$node_bkp->safe_psql('postgres', q(CREATE EXTENSION injection_points)); + +my $backup_name = 'backup_redo_missing'; +my $psql = $node_bkp->background_psql('postgres'); + +$psql->query_safe("SET client_min_messages TO WARNING"); +$psql->set_query_timer_restart; +$psql->query_safe("select pg_backup_start('test label')"); + +# Force backup start WAL location and follow-up checkpoint to land on different +# segments for deterministic missing-redo behavior. +$node_bkp->safe_psql('postgres', 'SELECT pg_switch_wal()'); + +my $backup_dir = $node_bkp->backup_dir . '/' . $backup_name; +PostgreSQL::Test::RecursiveCopy::copypath($node_bkp->data_dir, $backup_dir); + +# Remove runtime files and pg_control from the copied backup. pg_control is +# copied again later after the forced checkpoint to keep recovery deterministic. +unlink("$backup_dir/postmaster.pid") + or BAIL_OUT("unable to unlink $backup_dir/postmaster.pid"); +unlink("$backup_dir/postmaster.opts") + or BAIL_OUT("unable to unlink $backup_dir/postmaster.opts"); +unlink("$backup_dir/global/pg_control") + or BAIL_OUT("unable to unlink $backup_dir/global/pg_control"); + +run_split_checkpoint($node_bkp); + +copy($node_bkp->data_dir . '/global/pg_control', + "$backup_dir/global/pg_control") + or BAIL_OUT("unable to copy global/pg_control"); + +my $backup_label = + $psql->query_safe("select labelfile from pg_backup_stop()"); +$psql->quit; + +my ($redo_lsn_bkp, $redo_walfile_name_bkp, $checkpoint_lsn_bkp, + $checkpoint_walfile_name_bkp) = checkpoint_wal_info($node_bkp); + +isnt($redo_walfile_name_bkp, $checkpoint_walfile_name_bkp, + 'redo and checkpoint records on different segments with backup_label'); + +# Rewrite backup_label so recovery follows the deterministic LSN pair created +# above rather than whatever pg_backup_stop() reported originally. +$backup_label =~ s/^START WAL LOCATION: .*$/START WAL LOCATION: $redo_lsn_bkp (file $redo_walfile_name_bkp)/m; +$backup_label =~ s/^CHECKPOINT LOCATION: .*$/CHECKPOINT LOCATION: $checkpoint_lsn_bkp/m; + +open my $fh, '>>', "$backup_dir/backup_label" + or die "could not open backup_label"; +binmode $fh; +print $fh $backup_label; +close $fh; + +# Restore the prepared low-level backup into a separate node that will be used +# only to exercise startup failure. +my $node_bkp_reco = PostgreSQL::Test::Cluster->new('testnode_backup_reco'); +$node_bkp_reco->init_from_backup($node_bkp, $backup_name); +$node_bkp_reco->append_conf('postgresql.conf', "archive_mode = off"); + +# Keep checkpoint WAL available to ensure startup reaches redo verification. +my $checkpoint_src = $node_bkp->data_dir . "/pg_wal/$checkpoint_walfile_name_bkp"; +$checkpoint_src = $node_bkp->archive_dir . "/$checkpoint_walfile_name_bkp" + unless -e $checkpoint_src; + +copy($checkpoint_src, + $node_bkp_reco->data_dir . "/pg_wal/$checkpoint_walfile_name_bkp") + or die "could not copy checkpoint WAL file: $!"; + +# Remove only redo WAL to force the missing-redo failure while leaving the +# checkpoint segment available. +if (-e $node_bkp_reco->data_dir . "/pg_wal/$redo_walfile_name_bkp") +{ + unlink $node_bkp_reco->data_dir . "/pg_wal/$redo_walfile_name_bkp" + or die "could not remove local WAL file: $!"; +} + +run_log( + [ + 'pg_ctl', + '--pgdata' => $node_bkp_reco->data_dir, + '--log' => $node_bkp_reco->logfile, + 'start', + ]); + +my $logfile_bkp = slurp_file($node_bkp_reco->logfile()); +ok( $logfile_bkp =~ + qr/FATAL: .* could not find redo location .* referenced by checkpoint record at .*/, + "ends with FATAL because it could not find redo location with backup_label"); + done_testing(); -- 2.43.0
