pgsql: Fix rare instability in recovery TAP test 004_timeline_switch

Michael Paquier Wed, 04 Mar 2026 17:06:58 -0800

Fix rare instability in recovery TAP test 004_timeline_switch

This fixes a problem similar to ad8c86d22cbd.  In this case, the test
could fail under the following circumstances:
- The primary is stopped with teardown_node(), meaning that it may not
be able to send all its WAL records to standby_1 and standby_2.
- If standby_2 receives more records than standby_1, attempting to
reconnect standby_2 to the promoted standby_1 would fail because of a
timeline fork.


This race condition is fixed with a simple trick: instead of tearing
down the primary, it is stopped cleanly so as all the WAL records of the
primary are received and flushed by both standby_1 and standby_2.  Once
we do that, there is no need for a wait_for_catchup() before stopping
the node.  The test wants to check that a timeline jump can be achieved
when reconnecting a standby to a promoted standby in the same cluster,
hence an immediate stop of the primary is not required.

This failure is harder to reach than the previous instability of
009_twophase, still the buildfarm has been able to detect this failure
at least once.  I have tried Alexander Lakhin's test trick with the
bgwriter and very aggressive standby snapshots, but I could not
reproduce it directly.  It is reachable, as the buildfarm has proved.

Backpatch down to all supported branches, and this problem can lead to
spurious failures in the buildfarm.

Discussion: https://postgr.es/m/[email protected]
Backpatch-through: 14

Branch
------
REL_14_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/1346794feb776603232e4cd330affe31ab3c4e1e

Modified Files
--------------
src/test/recovery/t/004_timeline_switch.pl | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

pgsql: Fix rare instability in recovery TAP test 004_timeline_switch

Reply via email to