On Tue, Mar 17, 2026 at 9:04 AM Xuneng Zhou <[email protected]> wrote: > > Hi, > > Thanks for the patch. > > On Tue, Mar 17, 2026 at 5:49 AM Marco Nenciarini > <[email protected]> wrote: > > > > Attached is a v2 patch that implements the "handshake clamp" approach > > Xuneng suggested. Rather than tracking lastStreamedFlush in > > process-local state (which doesn't survive a cascade restart, as > > Fujii-san demonstrated), it uses the WAL flush position already > > returned by IDENTIFY_SYSTEM. > > > > The walreceiver now checks the upstream's flush position before issuing > > START_REPLICATION. If the requested startpoint is ahead (on the same > > timeline), it waits for wal_retrieve_retry_interval and retries. This > > works across restarts since it queries the upstream's live position on > > every connection attempt, and requires no new state variables. > > > > When timelines differ, we let START_REPLICATION handle the timeline > > negotiation as before. > > > > The patch includes a TAP test (053_cascade_reconnect.pl) that > > reproduces the scenario and verifies the fix. > > > > I haven’t looked into it in detail yet, but it looks good overall. > I’ll test it further and verify that the issue has been resolved.
One thing I’m not sure about is whether we need to create a standalone test file for this patch, or if it would fit well within existing TAP tests. I found several places for integration: 001_stream_rep.pl: it already has a primary -> standby -> cascading standby setup, and it even touches primary_conninfo reload behavior. But it is already a large mixed-purpose file, and this bug needs a fairly specific archive-fallback reconnection story. Adding it there would make that file even less focused. 025_stuck_on_old_timeline.pl: this is the nearest thematic neighbor since it combines cascading replication and archive/stream interactions. But it is really about timeline-following after promotion, not “downstream advances via archive and then must reconnect to an upstream that is still behind”. 048_vacuum_horizon_floor.pl: it already exercises stopping and restarting walreceiver via primary_conninfo reload, but it has nothing to do with archive fallback or cascading reconnect logic. The failure scenario is specific enough, and the three-node setup plus archive fallback plus reconnect check seems to be a coherent reproducer on its own. -- Best, Xuneng
