Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Marco Nenciarini Wed, 18 Mar 2026 01:34:22 -0700

On Wed, Mar 18, 2026 at 2:51 AM Xuneng Zhou <[email protected]> wrote:

> On Tue, Mar 17, 2026 at 8:20 PM Xuneng Zhou <[email protected]> wrote:
> >
> > On Tue, Mar 17, 2026 at 7:56 PM Marco Nenciarini
> > <[email protected]> wrote:
> > >
> > > Thanks for verifying the fix and improving the test, Xuneng.
> > >
> > > The wait_for_event() synchronization is a nice addition — it gives
> > > deterministic proof that the walreceiver actually entered the
> > > upstream-catchup path.  The scoped log window with slurp_file() is
> > > also cleaner than the broad log_contains() I had before.
> > >
>
> After thinking about this more, I’m less satisfied and convinced with
> polling at wal_retrieve_retry_interval. If the upstream stalls for a
> long time, or permanently, the walreceiver can loop indefinitely,
> leaving startup effectively pinned in the streaming path instead of
> switching to other WAL sources. In that case, repeated “ahead of flush
> position” log entries can also become noisy. On the other hand, if the
> upstream catches up quickly, walreceiver still won’t notice until the
> next interval, adding unnecessary latency of up to one full
> wal_retrieve_retry_interval.
>

Good points, Xuneng.

For the log noise: we could emit the first "ahead of flush position"
message at LOG level, then demote subsequent attempts to DEBUG1 until
the condition clears.  That keeps the initial occurrence visible for
diagnostics without flooding the log during a long wait.

For the indefinite loop: I agree that unbounded polling is not ideal.
The gap this fix targets is bounded in practice: the startup process
alternates between archive recovery and streaming attempts, so at
each streaming attempt the cascade is at most one WAL segment ahead
of the upstream.  If the gap is larger than that, something more
fundamental is wrong and the walreceiver should get out of the way
so the startup process can fall back to other WAL sources.

We could cap the wait with a threshold: if startpoint is more than
one wal_segment_size ahead of the upstream's flush position, skip the
wait and let START_REPLICATION proceed normally (and fail), so the
walreceiver exits and the startup process can switch to archive.
That way we absorb the one-segment gap that arises naturally from
archive recovery, without masking larger problems.

Thoughts on whether wal_segment_size is the right bound, or if
something else would be more appropriate?

Best regards,
Marco

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Reply via email to