Hi Xuneng, On Wed, Mar 19, 2026 at 10:33 AM Xuneng Zhou <[email protected]> wrote: > > I am not sure about this bound here. It seems to me that the gap could > be several segments due to the upstream lag.
The gap is bounded to one segment, but let me explain why more clearly because I don't think I did a good job of it before. The mismatch exists even when both nodes have replayed exactly the same WAL. The upstream standby never produces WAL itself - it only advances via archive (or streaming, but that's exactly what we're trying to establish here). After both have replayed the same archived file, the cascade's "next position to read" ends up just past the upstream's "last position replayed" (which is what GetStandbyFlushRecPtr reports). That gap is inherently within one segment. This is the core of the bug: the gap never closes on its own. When the next WAL file arrives, both nodes restore it and advance by one full segment, but the same mismatch reappears. They keep advancing and never successfully start streaming. When the upstream is genuinely far behind (gap larger than one segment), the threshold correctly lets START_REPLICATION fail so the startup process can fall back to archive. > Even with only a one-segment gap, if the upstream server's flush LSN > does not advance, we would remain stuck polling indefinitely. True, but this is consistent with how the walreceiver already behaves when it is connected and streaming: if the upstream stops producing WAL, the walreceiver just sits there waiting on the connection indefinitely. So the polling behavior here is no worse than what already happens in normal operation. Best regards, Marco
