Hi Xuneng,

On Wed, Mar 19, 2026 at 10:33 AM Xuneng Zhou <[email protected]> wrote:
>
> I am not sure about this bound here. It seems to me that the gap could
> be several segments due to the upstream lag.

The gap is bounded to one segment, but let me explain why more
clearly because I don't think I did a good job of it before.

The mismatch exists even when both nodes have replayed exactly the
same WAL.  The upstream standby never produces WAL itself - it only
advances via archive (or streaming, but that's exactly what we're
trying to establish here).  After both have replayed the same
archived file, the cascade's "next position to read" ends up just
past the upstream's "last position replayed" (which is what
GetStandbyFlushRecPtr reports).  That gap is inherently within one
segment.

This is the core of the bug: the gap never closes on its own.  When
the next WAL file arrives, both nodes restore it and advance by one
full segment, but the same mismatch reappears.  They keep advancing
and never successfully start streaming.

When the upstream is genuinely far behind (gap larger than one
segment), the threshold correctly lets START_REPLICATION fail so the
startup process can fall back to archive.

> Even with only a one-segment gap, if the upstream server's flush LSN
> does not advance, we would remain stuck polling indefinitely.

True, but this is consistent with how the walreceiver already behaves
when it is connected and streaming: if the upstream stops producing
WAL, the walreceiver just sits there waiting on the connection
indefinitely.  So the polling behavior here is no worse than what
already happens in normal operation.

Best regards,
Marco

Reply via email to