Re: 004_timeline_switch TAP test may fail

Sergey Tatarintsev Tue, 16 Jun 2026 17:53:31 -0700

Hi!
Thanks for review!
v2 patch attached
comment changed, count(1) replaced with count(*).

Hi Sergey,

Thanks for the report and patch. I think the analysis is right, and the
fix is in the right place.

The gap traces back to commit 7185eddf, which deliberately dropped the
wait_for_catchup() and switched the primary from teardown_node() to a
clean stop(), on the grounds that a clean stop flushes all WAL to both
standbys before exiting. That's true, but only for standbys whose
walsender is *connected* at shutdown time -- and ->start() only waits
for the postmaster to accept connections, not for the standby's
walreceiver to have connected back to the primary. So if a standby
hasn't connected yet when the primary stops, the clean-shutdown flush
skips it, and we're back to the exact "standbys received different
amounts of WAL -> timeline fork on reconnect" failure that 7185eddf was
meant to fix.

Polling pg_stat_replication until both walsenders are present closes
that hole: it re-establishes the precondition the clean-stop design
silently assumed. And connection is enough here -- the walsender
shutdown path sends all WAL up to the shutdown checkpoint regardless of
catchup state -- so there's no need to additionally check
state = 'streaming'.

One small thing: the rest of this file uses count(*), so I'd write count(*) = 2
rather than count(1) = 2 just for local consistency. And the comment reads a
little better as something like "Wait until both standbys have
connected to the primary",
since by this point they've already started -- what we're waiting for is the
connection.

Regards,
Ewan

On Tue, Jun 16, 2026 at 4:01 PM Sergey Tatarintsev
<[email protected]> wrote:

Hi hackers!

I found that after commit 7185eddf0522b3146ed1ff6e063e8e129e77c706 we
got little omission
in TAP test 004_timeline_switch:
...
my $node_standby_1 = PostgreSQL::Test::Cluster->new('standby_1');
...
$node_primary->stop;

There is no guarantee that standby_1 and standby_2 was successfully
connected to primary and start
streaming before primary stopped.

   I think we must ensure that primary knows about standby_1 and standby_2

--
With best regards,
Sergey Tatarintsev,
PostgresPro



--
With best regards,
Sergey Tatarintsev,
PostgresPro

From dd449396060c71c34bcb03e5c1c5de0cb0868da0 Mon Sep 17 00:00:00 2001
From: Sergey Tatarintsev <[email protected]>
Date: Tue, 16 Jun 2026 11:57:39 +0700
Subject: [PATCH] Fix 004_timeline_switch TAP test: wait for standbys starts
 before primary stops

---
 src/test/recovery/t/004_timeline_switch.pl | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/test/recovery/t/004_timeline_switch.pl b/src/test/recovery/t/004_timeline_switch.pl
index e0b3851927c..c87f079cef8 100644
--- a/src/test/recovery/t/004_timeline_switch.pl
+++ b/src/test/recovery/t/004_timeline_switch.pl
@@ -30,6 +30,10 @@ $node_standby_2->init_from_backup($node_primary, $backup_name,
 	has_streaming => 1);
 $node_standby_2->start;
 
+# Wait until both standbys have connected to the primary
+$node_primary->poll_query_until('postgres',
+	"SELECT count(*) = 2 FROM pg_stat_replication");
+
 # Create some content on primary
 $node_primary->safe_psql('postgres',
 	"CREATE TABLE tab_int AS SELECT generate_series(1,1000) AS a");
-- 
2.43.0

Re: 004_timeline_switch TAP test may fail

Reply via email to