Hi,

On Sat, Feb 10, 2024 at 05:02:27PM -0800, Noah Misch wrote:
> Coincidentally, one of my buildfarm animals hanged several weeks in a
> different test, 035_standby_logical_decoding.pl.  A LOG_SNAPSHOT_INTERVAL_MS
> reduction was part of making it reproducible:
> 
> On Fri, Feb 02, 2024 at 04:01:45PM -0800, Noah Misch wrote:
> > On Fri, Feb 02, 2024 at 02:30:03PM -0800, Noah Misch wrote:
> > > On Fri, Feb 02, 2024 at 05:07:14PM -0500, Tom Lane wrote:
> > > > If you look at the buildfarm's failures page and filter down to
> > > > just subscriptionCheck failures, what you find is that all of the
> > > > last 6 such failures are in 031_column_list.pl:
> 
> > > https://www.postgresql.org/message-id/flat/16d6d9cc-f97d-0b34-be65-425183ed3721%40gmail.com
> > > reported a replacement BgWriterDelay value reproducing it.
> > 
> > Correction: the recipe changes LOG_SNAPSHOT_INTERVAL_MS in addition to
> > BgWriterDelay.
> 
> I'm reusing this thread just in case there's overlap with the
> 031_column_list.pl cause and fix.  The 035_standby_logical_decoding.pl hang is
> a race condition arising from an event sequence like this:
> 
> - Test script sends CREATE SUBSCRIPTION to subscriber, which loses the CPU.
> - Test script calls pg_log_standby_snapshot() on primary.  Emits 
> XLOG_RUNNING_XACTS.
> - checkpoint_timeout makes a primary checkpoint finish.  Emits 
> XLOG_RUNNING_XACTS.
> - bgwriter executes LOG_SNAPSHOT_INTERVAL_MS logic.  Emits XLOG_RUNNING_XACTS.
> - CREATE SUBSCRIPTION wakes up and sends CREATE_REPLICATION_SLOT to standby.
> 
> Other test code already has a solution for this, so the attached patches add a
> timeout and copy the existing solution.  I'm also attaching the hack that
> makes it 100% reproducible.

Thanks!

I did a few tests and confirm that the proposed solution fixes the corner case.

standby-slot-test-1-timeout-v1.patch LGTM.

Regarding standby-slot-test-2-race-v1.patch:

> +# See corresponding create_logical_slot_on_standby() code.
> +$node_standby->poll_query_until(
> +     'postgres', qq[
> +             SELECT restart_lsn IS NOT NULL
> +             FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
> +     ]) or die "timed out waiting for logical slot to calculate its 
> restart_lsn";
> +

What about creating a sub, say wait_for_restart_lsn_calculation() in Cluster.pm
and then make use of it in create_logical_slot_on_standby() and above? 
(something
like wait_for_restart_lsn_calculation-v1.patch attached).

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
commit 14d988c748f3e500c44d65e073c276e6d8af6156
Author: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Date:   Wed Feb 14 15:21:35 2024 +0000

    wait_for_restart_lsn_calculation

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index e2e70d0dbf..21cf179db1 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3174,6 +3174,28 @@ $SIG{TERM} = $SIG{INT} = sub {
 
 =pod
 
+=item $node->wait_for_restart_lsn_calculation(self, slot_name)
+
+Create logical replication slot on given standby
+
+=cut
+
+sub wait_for_restart_lsn_calculation
+{
+	my ($self, $slot_name) = @_;
+	my ($stdout, $stderr);
+
+	$self->poll_query_until(
+		'postgres', qq[
+		SELECT restart_lsn IS NOT NULL
+		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
+	])
+	  or die
+	  "timed out waiting for logical slot to calculate its restart_lsn";
+}
+
+=pod
+
 =item $node->create_logical_slot_on_standby(self, primary, slot_name, dbname)
 
 Create logical replication slot on given standby
@@ -3203,13 +3225,7 @@ sub create_logical_slot_on_standby
 	# xl_running_xacts WAL record from the restart_lsn onwards. First wait
 	# until the slot restart_lsn is determined.
 
-	$self->poll_query_until(
-		'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = '$slot_name'
-	])
-	  or die
-	  "timed out waiting for logical slot to calculate its restart_lsn";
+	$self->wait_for_restart_lsn_calculation($slot_name);
 
 	# Then arrange for the xl_running_xacts record for which pg_recvlogical is
 	# waiting.
diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl b/src/test/recovery/t/035_standby_logical_decoding.pl
index f7b7fc7f9e..85330720c5 100644
--- a/src/test/recovery/t/035_standby_logical_decoding.pl
+++ b/src/test/recovery/t/035_standby_logical_decoding.pl
@@ -465,12 +465,8 @@ $psql_subscriber{subscriber_stdin} .= "\n";
 
 $psql_subscriber{run}->pump_nb();
 
-# See corresponding create_logical_slot_on_standby() code.
-$node_standby->poll_query_until(
-	'postgres', qq[
-		SELECT restart_lsn IS NOT NULL
-		FROM pg_catalog.pg_replication_slots WHERE slot_name = 'tap_sub'
-	]) or die "timed out waiting for logical slot to calculate its restart_lsn";
+# Wait for restart_lsn calculation
+$node_standby->wait_for_restart_lsn_calculation('tap_sub');
 
 # Speed up the subscription creation
 $node_primary->safe_psql('postgres', "SELECT pg_log_standby_snapshot()");

Reply via email to