Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Xuneng Zhou Thu, 19 Mar 2026 17:53:13 -0700

On Thu, Mar 19, 2026 at 5:59 PM Marco Nenciarini
<[email protected]> wrote:
>
> Hi Xuneng,
>
> On Wed, Mar 19, 2026 at 10:33 AM Xuneng Zhou <[email protected]> wrote:
> >
> > I am not sure about this bound here. It seems to me that the gap could
> > be several segments due to the upstream lag.
>
> The gap is bounded to one segment, but let me explain why more
> clearly because I don't think I did a good job of it before.
>
> The mismatch exists even when both nodes have replayed exactly the
> same WAL.  The upstream standby never produces WAL itself - it only
> advances via archive (or streaming, but that's exactly what we're
> trying to establish here).  After both have replayed the same
> archived file, the cascade's "next position to read" ends up just
> past the upstream's "last position replayed" (which is what
> GetStandbyFlushRecPtr reports).  That gap is inherently within one
> segment.
>
> This is the core of the bug: the gap never closes on its own.  When
> the next WAL file arrives, both nodes restore it and advance by one
> full segment, but the same mismatch reappears.  They keep advancing
> and never successfully start streaming.
>
> When the upstream is genuinely far behind (gap larger than one
> segment), the threshold correctly lets START_REPLICATION fail so the
> startup process can fall back to archive.


The one-segment bound holds for the case where both nodes have
replayed exactly the same WAL — the gap comes from
RequestXLogStreaming truncating recptr to the segment boundary, so
startpoint is always at the start of the next segment while
GetStandbyFlushRecPtr returns replayPtr within the current one. I
think that part of the analysis is correct.

But the gap can legitimately be multiple segments. Consider: the
upstream standby goes down (or is restarted for maintenance) while the
primary keeps generating and archiving WAL. The cascade's walreceiver
loses its connection, startup falls back to archive recovery, and
restores as many segments as the archive can supply. Meanwhile the
upstream's replayPtr is frozen at wherever it was when it went down.

When the upstream comes back and the cascade tries to reconnect, the
gap can be many segments — bounded only by how much WAL the primary
archived while the upstream was down. This is a normal operational
scenario (the reproducer_restart_upstream_portable.sh script exercises
 this), not "something fundamentally wrong."

The question is whether we should fail in this case. As you mentioned,
if this behavior is intentional, that’s fine. Otherwise, it could lead
to problems.

If there's a consensus for this and the fix of one-segment gap, the
current tap test would become non-deterministic.

The test controls the amount of WAL generated (1000 rows of integers —
a few hundred KB at most), which with the default 16MB
wal_segment_size almost certainly fits within one segment. And
pg_switch_wal() ensures one segment boundary is created and archived.
So in practice, one new segment ends up in the archive that standby_a
doesn't have.

But there's no explicit mechanism to guarantee this. Several things
could cause more WAL than expected:

Checkpoints running between step 1 (catchup) and step 4 (INSERT) could
push the WAL position across a segment boundary.
Background WAL activity (stats, etc.) adds volume.
If the initial data (1000 rows from CREATE TABLE ... generate_series)
happened to leave the WAL position near the end of a segment, the new
INSERT could spill into the next segment.

The test also doesn't verify the gap size — it never checks what
startpoint or primaryFlushPtr actually are. It only checks the
outcome: the wait event is hit (WalReceiverUpstreamCatchup) and no
"ahead of flush position" errors appear in the log. These assertions
would pass regardless of whether the gap is one segment or five.

So the test works because the workload is small enough to make a
one-segment gap the most likely outcome, but it's not guaranteed. If a
threshold at wal_segment_size were added to the patch, there would be
no test coverage for the multi-segment case — and no guarantee the
existing test wouldn't accidentally exercise it on systems with
smaller wal_segment_size or heavier background WAL activity.


> > Even with only a one-segment gap, if the upstream server's flush LSN
> > does not advance, we would remain stuck polling indefinitely.
>
> True, but this is consistent with how the walreceiver already behaves
> when it is connected and streaming: if the upstream stops producing
> WAL, the walreceiver just sits there waiting on the connection
> indefinitely.  So the polling behavior here is no worse than what
> already happens in normal operation.

I think the difference is that -- during normal streaming,
wal_receiver_timeout will eventually fire and kill the connection,
whereas the catch-up polling loop has no such timeout.

--
Best,
Xuneng

Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery

Reply via email to