Hi hackers! We hit an interesting bug(?) in archive recovery during a planned HA switchover. One of standbys followed wrong timeline. I think I saw this problem before in pgsql-hackers but googling didn't help.
Here's employed switchover algorithm: 0. Old Primary's synchronoud_standbby_names turned to only New Primary. 1. Standbys (except New Primary) conninfor turned to New Primary, so they are cascade now. 2. Old Primary is told to shut down asynchronously 3. After 5 seconds New Primary is promoted 4. When Old Primary shutted down, it is rewinded if shutdown took more than 5s. Here are logs from standby that stayed on timeline of Old Primary, sorry for verbosity: 2026-02-18 11:50:01.173 MSK,,,2534,,69957d39.9e6,1,,2026-02-18 11:50:01 MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A3000000 on timeline 9",,,,,,,,,"","walreceiver",,0 2026-02-18 11:52:01.017 MSK,,,2318,,69957d35.90e,7,,2026-02-18 11:49:57 MSK,,0,LOG,00000,"received SIGHUP, reloading configuration files",,,,,,,,,"","postmaster",,0 2026-02-18 11:52:01.018 MSK,,,2318,,69957d35.90e,8,,2026-02-18 11:49:57 MSK,,0,LOG,00000,"parameter ""primary_conninfo"" changed to ""host=rc1a-skip.mdb.yandexcloud.net port=5432 user=repl keepalives_idle=15 keepalives_interval=3 keepalives_count=5 application_name=rc1d_skip_mdb_yandexcloud_n> 2026-02-18 11:52:01.020 MSK,,,2320,,69957d36.910,26,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"WAL receiver process shutdown requested",,,,,,,,,"","startup",,0 2026-02-18 11:52:01.020 MSK,,,2534,,69957d39.9e6,2,,2026-02-18 11:50:01 MSK,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,"","walreceiver",,0 2026-02-18 11:52:01.186 MSK,,,4227,,69957db1.1083,1,,2026-02-18 11:52:01 MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A7000000 on timeline 9",,,,,,,,,"","walreceiver",,0 2026-02-18 11:52:12.193 MSK,,,4227,,69957db1.1083,2,,2026-02-18 11:52:01 MSK,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 9 at 2EC2/A71F0820.",,,,,,,,"","walreceiver",,0 2026-02-18 11:52:12.198 MSK,,,4227,,69957db1.1083,3,,2026-02-18 11:52:01 MSK,,0,LOG,00000,"fetching timeline history file for timeline 10 from primary server",,,,,,,,,"","walreceiver",,0 2026-02-18 11:52:12.207 MSK,,,4227,,69957db1.1083,4,,2026-02-18 11:52:01 MSK,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,"","walreceiver",,0 2026-02-18 11:52:12.585 MSK,,,2320,,69957d36.910,27,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"restored log file ""0000000A.history"" from archive",,,,,,,,,"","startup",,0 2026-02-18 11:52:12.588 MSK,,,2320,,69957d36.910,28,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"new target timeline is 10",,,,,,,,,"","startup",,0 2026-02-18 11:52:12.861 MSK,,,2320,,69957d36.910,29,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"restored log file ""0000000900002EC2000000A7"" from archive",,,,,,,,,"","startup",,0 2026-02-18 11:52:13.122 MSK,,,2320,,69957d36.910,30,,2026-02-18 11:49:58 MSK,1/0,0,LOG,00000,"unexpected pageaddr 2EC2/8000000 in log segment 0000000900002EC2000000A8, offset 0",,,,,,,,,"","startup",,0 2026-02-18 11:52:13.178 MSK,,,4382,,69957dbd.111e,1,,2026-02-18 11:52:13 MSK,,0,LOG,00000,"started streaming WAL from primary at 2EC2/A8000000 on timeline 10",,,,,,,,,"","walreceiver",,0 2026-02-18 11:52:13.178 MSK,,,4382,,69957dbd.111e,2,,2026-02-18 11:52:13 MSK,,0,FATAL,08P01,"could not receive data from WAL stream: ERROR: requested starting point 2EC2/A8000000 is ahead of the WAL flush position of this server 2EC2/A71F6AE0",,,,,,,,,"","walreceiver",,0 One interesting part is "fetching timeline history file for timeline 10 from primary" was killed, probably by startup process reaching timeline end. (that's a separate issue worth looking at, but survivable - the archive is there as a fallback) We restored history file successfully from archive and decided to follow timeline 10. But 0000000_A_00002EC2000000A7 was not archived yet, while 0000000_9_00002EC2000000A7 already was there. So we took 0000000900002EC2000000A7 and applied. Now we are beyond timeline switchpoint! The archive contained the new timeline's history file (so recovery correctly identified TL10 as the target) but the WAL segment at the switch-point position had not yet been archived for TL10. The same segment number existed on TL9 (the old primary continued writing after promotion). Recovery applied it: restored log file "0000000900002EC2000000A7" from archive That segment carries post-divergence data from the old primary. Applying it is wrong and the damage is silent, we cannot get to timeline 10 now. Root cause: XLogFileReadAnyTLI iterates expectedTLEs newest-first and falls back to older timelines when a segment is not found. The existing beginseg check is meant to prevent this, but it only guards the final timeline's switch point (targetBeginSeg = target TL's begin_seg). In a three-timeline chain (TL1 -> TL2 -> TL3), an absent TL2 segment at TL2's own switch point is not guarded because segno < TL3.begin_seg, so TL1's divergent segment is used. Fix: replace the targetBeginSeg pre-check with a found_eligible flag. The invariant is simple: for any segment, exactly one timeline owns it -- the newest one whose begin_seg <= segno. If that segment is absent, recovery must stop, not fall back. Once the loop identifies the first eligible timeline and fails to open its segment, it breaks rather than continuing to older timelines. Two TAP tests cover this: - t/052: two-timeline case (the original scenario) - t/053: three-timeline case that the old targetBeginSeg guard missed I think same fix is correct even if we have multiple swithpoints in one segment. I recognise this changes existing behaviour. Someone who intentionally archives only a parent timeline's segments and relies on fallback will now see recovery stall instead of silently proceeding. I think that outcome is strictly better: the current behaviour opens a window for applying divergent data with no error, which is worse than a clear stall waiting for the right segment. Two related issues I'll address in separate threads later: - recovery_target_timeline='latest' picks by timeline ID. In a partition scenario the numerically highest TL may not have the most data. Choosing by end LSN would be safer. Maybe even at some point we will have to stop using incremented numbers for timelines... - The startup-kills-walreceiver interaction that created the fallback to archive in the first place. Patch attached. WDYT? Best regards, Andrey Borodin.
0001-Fix-archive-recovery-falling-back-to-wrong-timeline-.patch
Description: Binary data
