Fix rare instability in recovery TAP test 004_timeline_switch This fixes a problem similar to ad8c86d22cbd. In this case, the test could fail under the following circumstances: - The primary is stopped with teardown_node(), meaning that it may not be able to send all its WAL records to standby_1 and standby_2. - If standby_2 receives more records than standby_1, attempting to reconnect standby_2 to the promoted standby_1 would fail because of a timeline fork.
This race condition is fixed with a simple trick: instead of tearing down the primary, it is stopped cleanly so as all the WAL records of the primary are received and flushed by both standby_1 and standby_2. Once we do that, there is no need for a wait_for_catchup() before stopping the node. The test wants to check that a timeline jump can be achieved when reconnecting a standby to a promoted standby in the same cluster, hence an immediate stop of the primary is not required. This failure is harder to reach than the previous instability of 009_twophase, still the buildfarm has been able to detect this failure at least once. I have tried Alexander Lakhin's test trick with the bgwriter and very aggressive standby snapshots, but I could not reproduce it directly. It is reachable, as the buildfarm has proved. Backpatch down to all supported branches, and this problem can lead to spurious failures in the buildfarm. Discussion: https://postgr.es/m/[email protected] Backpatch-through: 14 Branch ------ REL_14_STABLE Details ------- https://git.postgresql.org/pg/commitdiff/1346794feb776603232e4cd330affe31ab3c4e1e Modified Files -------------- src/test/recovery/t/004_timeline_switch.pl | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)
