> On 20 Feb 2026, at 23:36, Andrey Borodin <[email protected]> wrote: > > Basically, it all boils down to simple invariant: "When restoring to > specific timeline we should not take turns to other timelines."
Here is patch 0002 -- an optimization, independent of 0001. After walrcv_endstreaming() returns, walreceiver fetches the new timeline's history file (WalRcvFetchTimeLineHistoryFiles) before transitioning to WALRCV_WAITING. During this window walreceiver remains in WALRCV_STREAMING. Startup sleeps in WAIT_EVENT_RECOVERY_WAL_STREAM receiving no new data. When it wakes and finds WalRcvStreaming()==true in the XLOG_FROM_STREAM handler, it kills walreceiver. The new walreceiver must reconnect and re-request the same switch -- wasteful but harmless. (In the original report this appears as "terminating walreceiver process due to administrator command" at 11:52:12.) Fix: add WALRCV_SWITCHING_TIMELINE. Walreceiver enters it just before WalRcvFetchTimeLineHistoryFiles(). WalRcvStreaming() returns false for this state, so startup backs off instead of killing walreceiver. WakeupRecovery() is called immediately after the transition so startup exits its indefinite RECOVERY_WAL_STREAM sleep without waiting for WalRcvWaitForStartPosition(). A guard in RequestXLogStreaming() is also required: because WALRCV_SWITCHING_TIMELINE is not "streaming", the XLOG_FROM_STREAM failure path no longer calls XLogShutdownWalRcv() before retrying archive. When startup cycles back to RequestXLogStreaming(), walreceiver may still be in WALRCV_SWITCHING_TIMELINE, which would Assert-fail the STOPPED||WAITING check. The guard returns early in that case. One concern: WALRCV_SWITCHING_TIMELINE is not protected by wal_receiver_timeout, which only runs inside the streaming loop. Before this patch startup's kill provided an implicit bound on the history fetch; now only TCP-level timeouts apply. I think this warrants a follow-up, but it is out of scope here. Test 054 uses an injection point to freeze walreceiver in WALRCV_SWITCHING_TIMELINE and verifies startup enters RecoveryRetrieveRetryInterval rather than killing walreceiver. WDYT? Best regards, Andrey Borodin.
v2-0001-Fix-archive-recovery-falling-back-to-wrong-timeli.patch
Description: Binary data
v2-0003-Add-test-for-walreceiver-WALRCV_SWITCHING_TIMELIN.patch
Description: Binary data
v2-0002-walreceiver-add-WALRCV_SWITCHING_TIMELINE-state.patch
Description: Binary data
