On Fri, Mar 20, 2026 at 8:52 AM Xuneng Zhou <[email protected]> wrote: > > On Thu, Mar 19, 2026 at 5:59 PM Marco Nenciarini > <[email protected]> wrote: > > > > Hi Xuneng, > > > > On Wed, Mar 19, 2026 at 10:33 AM Xuneng Zhou <[email protected]> wrote: > > > > > > I am not sure about this bound here. It seems to me that the gap could > > > be several segments due to the upstream lag. > > > > The gap is bounded to one segment, but let me explain why more > > clearly because I don't think I did a good job of it before. > > > > The mismatch exists even when both nodes have replayed exactly the > > same WAL. The upstream standby never produces WAL itself - it only > > advances via archive (or streaming, but that's exactly what we're > > trying to establish here). After both have replayed the same > > archived file, the cascade's "next position to read" ends up just > > past the upstream's "last position replayed" (which is what > > GetStandbyFlushRecPtr reports). That gap is inherently within one > > segment. > > > > This is the core of the bug: the gap never closes on its own. When > > the next WAL file arrives, both nodes restore it and advance by one > > full segment, but the same mismatch reappears. They keep advancing > > and never successfully start streaming. > > > > When the upstream is genuinely far behind (gap larger than one > > segment), the threshold correctly lets START_REPLICATION fail so the > > startup process can fall back to archive. > > The one-segment bound holds for the case where both nodes have > replayed exactly the same WAL — the gap comes from > RequestXLogStreaming truncating recptr to the segment boundary, so > startpoint is always at the start of the next segment while > GetStandbyFlushRecPtr returns replayPtr within the current one. I > think that part of the analysis is correct.
>> After both have replayed the same > > archived file, the cascade's "next position to read" ends up just > > past the upstream's "last position replayed" (which is what > > GetStandbyFlushRecPtr reports). After taking a closer look, I’m less certain about this. I’ll investigate further. Could you also explain why you think this is the case? > But the gap can legitimately be multiple segments. Consider: the > upstream standby goes down (or is restarted for maintenance) while the > primary keeps generating and archiving WAL. The cascade's walreceiver > loses its connection, startup falls back to archive recovery, and > restores as many segments as the archive can supply. Meanwhile the > upstream's replayPtr is frozen at wherever it was when it went down. > > When the upstream comes back and the cascade tries to reconnect, the > gap can be many segments — bounded only by how much WAL the primary > archived while the upstream was down. This is a normal operational > scenario (the reproducer_restart_upstream_portable.sh script exercises > this), not "something fundamentally wrong." > > The question is whether we should fail in this case. As you mentioned, > if this behavior is intentional, that’s fine. Otherwise, it could lead > to problems. > > If there's a consensus for this and the fix of one-segment gap, the > current tap test would become non-deterministic. > > The test controls the amount of WAL generated (1000 rows of integers — > a few hundred KB at most), which with the default 16MB > wal_segment_size almost certainly fits within one segment. And > pg_switch_wal() ensures one segment boundary is created and archived. > So in practice, one new segment ends up in the archive that standby_a > doesn't have. > > But there's no explicit mechanism to guarantee this. Several things > could cause more WAL than expected: > > Checkpoints running between step 1 (catchup) and step 4 (INSERT) could > push the WAL position across a segment boundary. > Background WAL activity (stats, etc.) adds volume. > If the initial data (1000 rows from CREATE TABLE ... generate_series) > happened to leave the WAL position near the end of a segment, the new > INSERT could spill into the next segment. > > The test also doesn't verify the gap size — it never checks what > startpoint or primaryFlushPtr actually are. It only checks the > outcome: the wait event is hit (WalReceiverUpstreamCatchup) and no > "ahead of flush position" errors appear in the log. These assertions > would pass regardless of whether the gap is one segment or five. > > So the test works because the workload is small enough to make a > one-segment gap the most likely outcome, but it's not guaranteed. If a > threshold at wal_segment_size were added to the patch, there would be > no test coverage for the multi-segment case — and no guarantee the > existing test wouldn't accidentally exercise it on systems with > smaller wal_segment_size or heavier background WAL activity. > > > > > Even with only a one-segment gap, if the upstream server's flush LSN > > > does not advance, we would remain stuck polling indefinitely. > > > > True, but this is consistent with how the walreceiver already behaves > > when it is connected and streaming: if the upstream stops producing > > WAL, the walreceiver just sits there waiting on the connection > > indefinitely. So the polling behavior here is no worse than what > > already happens in normal operation. > > I think the difference is that -- during normal streaming, > wal_receiver_timeout will eventually fire and kill the connection, > whereas the catch-up polling loop has no such timeout. > > -- > Best, > Xuneng -- Best, Xuneng
