> On 20 Feb 2026, at 23:36, Andrey Borodin <[email protected]> wrote:
> 
> Basically, it all boils down to simple invariant: "When restoring to
> specific timeline we should not take turns to other timelines."

Here is patch 0002 -- an optimization, independent of 0001.

After walrcv_endstreaming() returns, walreceiver fetches the new
timeline's history file (WalRcvFetchTimeLineHistoryFiles) before
transitioning to WALRCV_WAITING.  During this window walreceiver
remains in WALRCV_STREAMING.

Startup sleeps in WAIT_EVENT_RECOVERY_WAL_STREAM receiving no new
data.  When it wakes and finds WalRcvStreaming()==true in the
XLOG_FROM_STREAM handler, it kills walreceiver.  The new walreceiver
must reconnect and re-request the same switch -- wasteful but
harmless.  (In the original report this appears as "terminating
walreceiver process due to administrator command" at 11:52:12.)

Fix: add WALRCV_SWITCHING_TIMELINE.  Walreceiver enters it just
before WalRcvFetchTimeLineHistoryFiles().  WalRcvStreaming() returns
false for this state, so startup backs off instead of killing
walreceiver.  WakeupRecovery() is called immediately after the
transition so startup exits its indefinite RECOVERY_WAL_STREAM sleep
without waiting for WalRcvWaitForStartPosition().

A guard in RequestXLogStreaming() is also required: because
WALRCV_SWITCHING_TIMELINE is not "streaming", the XLOG_FROM_STREAM
failure path no longer calls XLogShutdownWalRcv() before retrying
archive.  When startup cycles back to RequestXLogStreaming(),
walreceiver may still be in WALRCV_SWITCHING_TIMELINE, which would
Assert-fail the STOPPED||WAITING check.  The guard returns early in
that case.

One concern: WALRCV_SWITCHING_TIMELINE is not protected by
wal_receiver_timeout, which only runs inside the streaming loop.
Before this patch startup's kill provided an implicit bound on the
history fetch; now only TCP-level timeouts apply.  I think this
warrants a follow-up, but it is out of scope here.

Test 054 uses an injection point to freeze walreceiver in
WALRCV_SWITCHING_TIMELINE and verifies startup enters
RecoveryRetrieveRetryInterval rather than killing walreceiver.

WDYT?

Best regards, Andrey Borodin.

Attachment: v2-0001-Fix-archive-recovery-falling-back-to-wrong-timeli.patch
Description: Binary data

Attachment: v2-0003-Add-test-for-walreceiver-WALRCV_SWITCHING_TIMELIN.patch
Description: Binary data

Attachment: v2-0002-walreceiver-add-WALRCV_SWITCHING_TIMELINE-state.patch
Description: Binary data

Reply via email to