On Thu, Mar 19, 2026 at 5:59 PM Marco Nenciarini <[email protected]> wrote: > > Hi Xuneng, > > On Wed, Mar 19, 2026 at 10:33 AM Xuneng Zhou <[email protected]> wrote: > > > > I am not sure about this bound here. It seems to me that the gap could > > be several segments due to the upstream lag. > > The gap is bounded to one segment, but let me explain why more > clearly because I don't think I did a good job of it before. > > The mismatch exists even when both nodes have replayed exactly the > same WAL. The upstream standby never produces WAL itself - it only > advances via archive (or streaming, but that's exactly what we're > trying to establish here). After both have replayed the same > archived file, the cascade's "next position to read" ends up just > past the upstream's "last position replayed" (which is what > GetStandbyFlushRecPtr reports). That gap is inherently within one > segment. > > This is the core of the bug: the gap never closes on its own. When > the next WAL file arrives, both nodes restore it and advance by one > full segment, but the same mismatch reappears. They keep advancing > and never successfully start streaming. > > When the upstream is genuinely far behind (gap larger than one > segment), the threshold correctly lets START_REPLICATION fail so the > startup process can fall back to archive.
The one-segment bound holds for the case where both nodes have replayed exactly the same WAL — the gap comes from RequestXLogStreaming truncating recptr to the segment boundary, so startpoint is always at the start of the next segment while GetStandbyFlushRecPtr returns replayPtr within the current one. I think that part of the analysis is correct. But the gap can legitimately be multiple segments. Consider: the upstream standby goes down (or is restarted for maintenance) while the primary keeps generating and archiving WAL. The cascade's walreceiver loses its connection, startup falls back to archive recovery, and restores as many segments as the archive can supply. Meanwhile the upstream's replayPtr is frozen at wherever it was when it went down. When the upstream comes back and the cascade tries to reconnect, the gap can be many segments — bounded only by how much WAL the primary archived while the upstream was down. This is a normal operational scenario (the reproducer_restart_upstream_portable.sh script exercises this), not "something fundamentally wrong." The question is whether we should fail in this case. As you mentioned, if this behavior is intentional, that’s fine. Otherwise, it could lead to problems. If there's a consensus for this and the fix of one-segment gap, the current tap test would become non-deterministic. The test controls the amount of WAL generated (1000 rows of integers — a few hundred KB at most), which with the default 16MB wal_segment_size almost certainly fits within one segment. And pg_switch_wal() ensures one segment boundary is created and archived. So in practice, one new segment ends up in the archive that standby_a doesn't have. But there's no explicit mechanism to guarantee this. Several things could cause more WAL than expected: Checkpoints running between step 1 (catchup) and step 4 (INSERT) could push the WAL position across a segment boundary. Background WAL activity (stats, etc.) adds volume. If the initial data (1000 rows from CREATE TABLE ... generate_series) happened to leave the WAL position near the end of a segment, the new INSERT could spill into the next segment. The test also doesn't verify the gap size — it never checks what startpoint or primaryFlushPtr actually are. It only checks the outcome: the wait event is hit (WalReceiverUpstreamCatchup) and no "ahead of flush position" errors appear in the log. These assertions would pass regardless of whether the gap is one segment or five. So the test works because the workload is small enough to make a one-segment gap the most likely outcome, but it's not guaranteed. If a threshold at wal_segment_size were added to the patch, there would be no test coverage for the multi-segment case — and no guarantee the existing test wouldn't accidentally exercise it on systems with smaller wal_segment_size or heavier background WAL activity. > > Even with only a one-segment gap, if the upstream server's flush LSN > > does not advance, we would remain stuck polling indefinitely. > > True, but this is consistent with how the walreceiver already behaves > when it is connected and streaming: if the upstream stops producing > WAL, the walreceiver just sits there waiting on the connection > indefinitely. So the polling behavior here is no worse than what > already happens in normal operation. I think the difference is that -- during normal streaming, wal_receiver_timeout will eventually fire and kill the connection, whereas the catch-up polling loop has no such timeout. -- Best, Xuneng
