Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Michael Paquier Mon, 14 Jul 2025 20:31:47 -0700

On Mon, Jun 30, 2025 at 05:51:51PM +0530, vignesh C wrote:
> Thanks, Kuroda-san. I’ve prepared a similar test that doesn’t rely on
> injection points. The issue reproduced consistently across all
> branches up to PG13. You can use the attached
> 049_slot_get_changes_wait_continously_pg17.pl script (found in the
> 049_slot_get_changes_wait_continously_pg17.zip file) to verify this.
> Just copy the script to src/test/recovery and run the test to observe
> the problem.


+# Abruptly stop the server (1 second should be enough for the checkpoint
+# to finish; it would be better).
+$node->stop('immediate');

While v3-0001 is good enough to reproduce the original issue after a
few dozen runs here, we cannot reintroduce the test without the proper
wait_for_log() logic making sure that the checkpoint is completed
before we stop the server.

I like the addition of an extra pg_logical_emit_message() in test 046
anyway, down to v17, in the test 046 for all the branches.  Even if
the reproduction is sporadic, we have seen it pretty quickly in the CI
and in the buildfarm so it would not go unnoticed for a long time if
we mess up with this stuff again.

We are lucky enough that the existing test does not fail in v17, TBH,
so let's tackle all the issues on this thread step by step: 
1) Fix let's add the header check.
2) Let's reintroduce a fixed version of 046 on HEAD and on v18, with
an extra pg_logical_emit_message() that travels across WAL pages
forcing a reconstruction of the record and the extra header check.
3) Let's fix recovery test 046 currently in v17, for the checkpoint
wait logic and the extra pg_logical_emit_message().
4) Let's improve the stability of 047 everywhere for the checkpoint
wait logic, as already done by Alexander.

I have been doing some tests with the patch vs HEAD (thanks for the
script, it has saved some time), and I can summarize my results by
using a message size of 819200000, hence worth 100k pages of WAL or
so.  Then, taking 20 runs, and eliminating the three highest and
lowest numbers to eliminate some of the variance, I am getting an
average of runtime when re-assembling the record of:
- 695473.09us for HEAD
- 695679.18us for the patch.
So I am not going to stress more on this point.

Attached is the result of all that for HEAD.  There is also one extra
patch dedicated to v17, where I have checked that the extra
pg_logical_emit_message() is enough to reproduce the problem without
the patch, and that the problem is fixed with the patch.  The patch
set for v17 is simpler of course, as test 046 still exists on
REL_17_STABLE.  Note that I have moved this extra contrecord to be
generated before the checkpoint is completed, leading to the same
result.

With all that said, I'll move on with this stuff once the embargo for
v18 beta2 is lifted and the tag is pushed.  That should happen in 24h
or so, I guess.
--
Michael

From 4c920861f07e0f4e6e21e24e5f6a9060054c3232 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 15 Jul 2025 12:01:01 +0900
Subject: [PATCH v4 1/3] Fix infinite wait when reading partially-written WAL
 record

If a crash occurs while writing a WAL record that spans multiple pages,
the recovery process marks the page with the
XLP_FIRST_IS_OVERWRITE_CONTRECORD flag.  However, logical decoding
currently attempts to read the full WAL record based on its expected
size before checking this flag, which can lead to an infinite wait if
the remaining data is never written (e.g., no activity after crash).

This patch updates the logic to first read the page header and check for
the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to
reconstruct the full WAL record.  If the flag is set, decoding correctly
identifies the record as incomplete and avoids waiting for WAL data that
will never arrive.
---
 src/backend/access/transam/xlogreader.c | 20 +++++++++++++++-----
 1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ac1f801b1eb0..2e5361ff4fdd 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -723,11 +723,12 @@ restart:
 			/* Calculate pointer to beginning of next page */
 			targetPagePtr += XLOG_BLCKSZ;
 
-			/* Wait for the next page to become available */
-			readOff = ReadPageInternal(state, targetPagePtr,
-									   Min(total_len - gotlen + SizeOfXLogShortPHD,
-										   XLOG_BLCKSZ));
-
+			/*
+			 * Read the page header before processing the record data, to be
+			 * able to handle the case where the previous record ended as
+			 * being a partial one.
+			 */
+			readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD);
 			if (readOff == XLREAD_WOULDBLOCK)
 				return XLREAD_WOULDBLOCK;
 			else if (readOff < 0)
@@ -776,6 +777,15 @@ restart:
 				goto err;
 			}
 
+			/* Wait for the next page to become available */
+			readOff = ReadPageInternal(state, targetPagePtr,
+									   Min(total_len - gotlen + SizeOfXLogShortPHD,
+										   XLOG_BLCKSZ));
+			if (readOff == XLREAD_WOULDBLOCK)
+				return XLREAD_WOULDBLOCK;
+			else if (readOff < 0)
+				goto err;
+
 			/* Append the continuation from this page to the buffer */
 			pageHeaderSize = XLogPageHeaderSize(pageHeader);
 
-- 
2.50.0

From 68fab6348a60320ff4dda943d69dc3388a5754e2 Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 15 Jul 2025 12:02:08 +0900
Subject: [PATCH v4 2/3] Improve stability of recovery test
 047_checkpoint_physical_slot

Author: Alexander Korotkov <akorot...@postgresql.org>
Backpatch-through: 17
---
 src/test/recovery/t/047_checkpoint_physical_slot.pl | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/src/test/recovery/t/047_checkpoint_physical_slot.pl b/src/test/recovery/t/047_checkpoint_physical_slot.pl
index a1332b5d44cb..9e98383e30ea 100644
--- a/src/test/recovery/t/047_checkpoint_physical_slot.pl
+++ b/src/test/recovery/t/047_checkpoint_physical_slot.pl
@@ -94,9 +94,11 @@ $node->safe_psql('postgres',
 	q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
 );
 
-# Continue the checkpoint.
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
 $node->safe_psql('postgres',
 	q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
 
 my $restart_lsn_old = $node->safe_psql('postgres',
 	q{select restart_lsn from pg_replication_slots where slot_name = 'slot_physical'}
@@ -104,8 +106,7 @@ my $restart_lsn_old = $node->safe_psql('postgres',
 chomp($restart_lsn_old);
 note("restart lsn before stop: $restart_lsn_old");
 
-# Abruptly stop the server (1 second should be enough for the checkpoint
-# to finish; it would be better).
+# Abruptly stop the server.
 $node->stop('immediate');
 
 $node->start;
-- 
2.50.0

From 4584664c18bb63a604c6d20eb22831ec6e59a49a Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 15 Jul 2025 12:07:50 +0900
Subject: [PATCH v4 3/3] Reintroduce test 046_checkpoint_logical_slot

This patch is only for HEAD and v18 where the test has been removed,
improvements to the checkpoint completion wait and the extra
pg_logical_emit_message() call to force the creation of a WAL record
that spawns through at least 2 pages needs to be backpatched down to v17
to improve the stability and coverage of the original test.

Author: Alexander Korotkov <akorot...@postgresql.org>
Co-authored-by: Hayato Kuroda <kuroda.hay...@fujitsu.com>
---
 src/test/recovery/meson.build                 |   1 +
 .../recovery/t/046_checkpoint_logical_slot.pl | 142 ++++++++++++++++++
 2 files changed, 143 insertions(+)
 create mode 100644 src/test/recovery/t/046_checkpoint_logical_slot.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 6e78ff1a030b..52993c32dbba 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -54,6 +54,7 @@ tests += {
       't/043_no_contrecord_switch.pl',
       't/044_invalidate_inactive_slots.pl',
       't/045_archive_restartpoint.pl',
+      't/046_checkpoint_logical_slot.pl',
       't/047_checkpoint_physical_slot.pl',
       't/048_vacuum_horizon_floor.pl'
     ],
diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
new file mode 100644
index 000000000000..bdfca3ddf826
--- /dev/null
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -0,0 +1,142 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+#
+# This test verifies the case when the logical slot is advanced during
+# checkpoint. The test checks that the logical slot's restart_lsn still refers
+# to an existed WAL segment after immediate restart.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($node, $result);
+
+$node = PostgreSQL::Test::Cluster->new('mike');
+$node->init;
+$node->append_conf('postgresql.conf', "wal_level = 'logical'");
+$node->start;
+
+# Check if the extension injection_points is available, as it may be
+# possible that this script is run with installcheck, where the module
+# would not be installed by default.
+if (!$node->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Create the two slots we'll need.
+$node->safe_psql('postgres',
+	q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')}
+);
+$node->safe_psql('postgres',
+	q{select pg_create_physical_replication_slot('slot_physical', true)});
+
+# Advance both slots to the current position just to have everything "valid".
+$node->safe_psql('postgres',
+	q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)}
+);
+$node->safe_psql('postgres',
+	q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Run checkpoint to flush current state to disk and set a baseline.
+$node->safe_psql('postgres', q{checkpoint});
+
+# Generate some transactions to get RUNNING_XACTS.
+my $xacts = $node->background_psql('postgres');
+$xacts->query_until(
+	qr/run_xacts/,
+	q(\echo run_xacts
+SELECT 1 \watch 0.1
+\q
+));
+
+$node->advance_wal(20);
+
+# Run another checkpoint to set a new restore LSN.
+$node->safe_psql('postgres', q{checkpoint});
+
+$node->advance_wal(20);
+
+# Run another checkpoint, this time in the background, and make it wait
+# on the injection point) so that the checkpoint stops right before
+# removing old WAL segments.
+note('starting checkpoint');
+
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_safe(
+	q(select injection_points_attach('checkpoint-before-old-wal-removal','wait'))
+);
+$checkpoint->query_until(
+	qr/starting_checkpoint/,
+	q(\echo starting_checkpoint
+checkpoint;
+\q
+));
+
+# Wait until the checkpoint stops right before removing WAL segments.
+note('waiting for injection_point');
+$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal');
+note('injection_point is reached');
+
+# Try to advance the logical slot, but make it stop when it moves to the next
+# WAL segment (this has to happen in the background, too).
+my $logical = $node->background_psql('postgres');
+$logical->query_safe(
+	q{select injection_points_attach('logical-replication-slot-advance-segment','wait');}
+);
+$logical->query_until(
+	qr/get_changes/,
+	q(
+\echo get_changes
+select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1
+\q
+));
+
+# Wait until the slot's restart_lsn points to the next WAL segment.
+note('waiting for injection_point');
+$node->wait_for_event('client backend',
+	'logical-replication-slot-advance-segment');
+note('injection_point is reached');
+
+# OK, we're in the right situation: time to advance the physical slot, which
+# recalculates the required LSN, and then unblock the checkpoint, which
+# removes the WAL still needed by the logical slot.
+$node->safe_psql('postgres',
+	q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
+);
+
+# Generate a long WAL record, spawning at least two pages for the follow-up
+# post-recovery check.
+$node->safe_psql('postgres',
+	q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
+
+# Continue the checkpoint and wait for its completion.
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres',
+	q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+# Abruptly stop the server.
+$node->stop('immediate');
+
+$node->start;
+
+eval {
+	$node->safe_psql('postgres',
+		q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);}
+	);
+};
+is($@, '', "Logical slot still valid");
+
+done_testing();
-- 
2.50.0

From c09e67a8e074a3235499bcf891487fdc1293072d Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@paquier.xyz>
Date: Tue, 15 Jul 2025 12:25:37 +0900
Subject: [PATCH v4] Improve recovery test 046_checkpoint_logical_slot

This commit improves 046_checkpoint_logical_slot on two aspects:
- Add one pg_logical_emit_message() call, to force the creation of a
record that spawns across two pages.
- Make the logic wait for the checkpoint completion.

Author: Alexander Korotkov <akorot...@postgresql.org>
Co-authored-by: Hayato Kuroda <kuroda.hay...@fujitsu.com>
---
 src/test/recovery/t/046_checkpoint_logical_slot.pl | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl
index 0468d4609e43..d2cf1cb4464e 100644
--- a/src/test/recovery/t/046_checkpoint_logical_slot.pl
+++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl
@@ -119,12 +119,18 @@ $node->safe_psql('postgres',
 	q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())}
 );
 
+# Generate a long WAL record, spawning at least two pages for the follow-up
+# post-recovery check.
+$node->safe_psql('postgres',
+	q{select pg_logical_emit_message(false, '', repeat('123456789', 1000))});
+
 # Continue the checkpoint.
+my $log_offset = -s $node->logfile;
 $node->safe_psql('postgres',
 	q{select injection_points_wakeup('checkpoint-before-old-wal-removal')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
 
-# Abruptly stop the server (1 second should be enough for the checkpoint
-# to finish; it would be better).
+# Abruptly stop the server.
 $node->stop('immediate');
 
 $node->start;
-- 
2.50.0

signature.asc
Description: PGP signature

Re: pg_logical_slot_get_changes waits continously for a partial WAL record spanning across 2 pages

Reply via email to