Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Xuneng Zhou Mon, 16 Mar 2026 18:15:48 -0700

On Tue, Mar 17, 2026 at 9:04 AM Xuneng Zhou <[email protected]> wrote:
>
> Hi,
>
> Thanks for the patch.
>
> On Tue, Mar 17, 2026 at 5:49 AM Marco Nenciarini
> <[email protected]> wrote:
> >
> > Attached is a v2 patch that implements the "handshake clamp" approach
> > Xuneng suggested.  Rather than tracking lastStreamedFlush in
> > process-local state (which doesn't survive a cascade restart, as
> > Fujii-san demonstrated), it uses the WAL flush position already
> > returned by IDENTIFY_SYSTEM.
> >
> > The walreceiver now checks the upstream's flush position before issuing
> > START_REPLICATION.  If the requested startpoint is ahead (on the same
> > timeline), it waits for wal_retrieve_retry_interval and retries.  This
> > works across restarts since it queries the upstream's live position on
> > every connection attempt, and requires no new state variables.
> >
> > When timelines differ, we let START_REPLICATION handle the timeline
> > negotiation as before.
> >
> > The patch includes a TAP test (053_cascade_reconnect.pl) that
> > reproduces the scenario and verifies the fix.
> >
>
> I haven’t looked into it in detail yet, but it looks good overall.
> I’ll test it further and verify that the issue has been resolved.


One thing I’m not sure about is whether we need to create a standalone
test file for this patch, or if it would fit well within existing TAP
tests.
I found several places for integration:

001_stream_rep.pl: it already has a primary -> standby -> cascading
standby setup, and it even touches primary_conninfo reload behavior.
But it is already a large mixed-purpose file, and this bug needs a
fairly specific archive-fallback reconnection story. Adding it there
would make that file even less focused.

025_stuck_on_old_timeline.pl: this is the nearest thematic neighbor
since it combines cascading replication and archive/stream
interactions. But it is really about timeline-following after
promotion, not “downstream advances via archive and then must
reconnect to an upstream that is still behind”.

048_vacuum_horizon_floor.pl: it already exercises stopping and
restarting walreceiver via primary_conninfo reload, but it has nothing
to do with archive fallback or cascading reconnect logic.

The failure scenario is specific enough, and the three-node setup plus
archive fallback plus reconnect check seems to be a coherent
reproducer on its own.


--
Best,
Xuneng

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Reply via email to