Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Xuneng Zhou Thu, 19 Mar 2026 20:34:22 -0700

On Fri, Mar 20, 2026 at 8:52 AM Xuneng Zhou <[email protected]> wrote:
>
> On Thu, Mar 19, 2026 at 5:59 PM Marco Nenciarini
> <[email protected]> wrote:
> >
> > Hi Xuneng,
> >
> > On Wed, Mar 19, 2026 at 10:33 AM Xuneng Zhou <[email protected]> wrote:
> > >
> > > I am not sure about this bound here. It seems to me that the gap could
> > > be several segments due to the upstream lag.
> >
> > The gap is bounded to one segment, but let me explain why more
> > clearly because I don't think I did a good job of it before.
> >
> > The mismatch exists even when both nodes have replayed exactly the
> > same WAL.  The upstream standby never produces WAL itself - it only
> > advances via archive (or streaming, but that's exactly what we're
> > trying to establish here).  After both have replayed the same
> > archived file, the cascade's "next position to read" ends up just
> > past the upstream's "last position replayed" (which is what
> > GetStandbyFlushRecPtr reports).  That gap is inherently within one
> > segment.
> >
> > This is the core of the bug: the gap never closes on its own.  When
> > the next WAL file arrives, both nodes restore it and advance by one
> > full segment, but the same mismatch reappears.  They keep advancing
> > and never successfully start streaming.
> >
> > When the upstream is genuinely far behind (gap larger than one
> > segment), the threshold correctly lets START_REPLICATION fail so the
> > startup process can fall back to archive.
>
> The one-segment bound holds for the case where both nodes have
> replayed exactly the same WAL — the gap comes from
> RequestXLogStreaming truncating recptr to the segment boundary, so
> startpoint is always at the start of the next segment while
> GetStandbyFlushRecPtr returns replayPtr within the current one. I
> think that part of the analysis is correct.



>> After both have replayed the same
> > archived file, the cascade's "next position to read" ends up just
> > past the upstream's "last position replayed" (which is what
> > GetStandbyFlushRecPtr reports).

After taking a closer look, I’m less certain about this. I’ll
investigate further. Could you also explain why you think this is the
case?


> But the gap can legitimately be multiple segments. Consider: the
> upstream standby goes down (or is restarted for maintenance) while the
> primary keeps generating and archiving WAL. The cascade's walreceiver
> loses its connection, startup falls back to archive recovery, and
> restores as many segments as the archive can supply. Meanwhile the
> upstream's replayPtr is frozen at wherever it was when it went down.
>
> When the upstream comes back and the cascade tries to reconnect, the
> gap can be many segments — bounded only by how much WAL the primary
> archived while the upstream was down. This is a normal operational
> scenario (the reproducer_restart_upstream_portable.sh script exercises
>  this), not "something fundamentally wrong."
>
> The question is whether we should fail in this case. As you mentioned,
> if this behavior is intentional, that’s fine. Otherwise, it could lead
> to problems.
>
> If there's a consensus for this and the fix of one-segment gap, the
> current tap test would become non-deterministic.
>
> The test controls the amount of WAL generated (1000 rows of integers —
> a few hundred KB at most), which with the default 16MB
> wal_segment_size almost certainly fits within one segment. And
> pg_switch_wal() ensures one segment boundary is created and archived.
> So in practice, one new segment ends up in the archive that standby_a
> doesn't have.
>
> But there's no explicit mechanism to guarantee this. Several things
> could cause more WAL than expected:
>
> Checkpoints running between step 1 (catchup) and step 4 (INSERT) could
> push the WAL position across a segment boundary.
> Background WAL activity (stats, etc.) adds volume.
> If the initial data (1000 rows from CREATE TABLE ... generate_series)
> happened to leave the WAL position near the end of a segment, the new
> INSERT could spill into the next segment.
>
> The test also doesn't verify the gap size — it never checks what
> startpoint or primaryFlushPtr actually are. It only checks the
> outcome: the wait event is hit (WalReceiverUpstreamCatchup) and no
> "ahead of flush position" errors appear in the log. These assertions
> would pass regardless of whether the gap is one segment or five.
>
> So the test works because the workload is small enough to make a
> one-segment gap the most likely outcome, but it's not guaranteed. If a
> threshold at wal_segment_size were added to the patch, there would be
> no test coverage for the multi-segment case — and no guarantee the
> existing test wouldn't accidentally exercise it on systems with
> smaller wal_segment_size or heavier background WAL activity.
>
>
> > > Even with only a one-segment gap, if the upstream server's flush LSN
> > > does not advance, we would remain stuck polling indefinitely.
> >
> > True, but this is consistent with how the walreceiver already behaves
> > when it is connected and streaming: if the upstream stops producing
> > WAL, the walreceiver just sits there waiting on the connection
> > indefinitely.  So the polling behavior here is no worse than what
> > already happens in normal operation.
>
> I think the difference is that -- during normal streaming,
> wal_receiver_timeout will eventually fire and kill the connection,
> whereas the catch-up polling loop has no such timeout.
>
> --
> Best,
> Xuneng



-- 
Best,
Xuneng

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Reply via email to