On Tue, Apr 7, 2026 at 3:56 PM Ashutosh Sharma <[email protected]> wrote: > > Hi, > > On Tue, Apr 7, 2026 at 11:20 AM Ashutosh Sharma <[email protected]> wrote: > > > > Hi, > > > > On Tue, Apr 7, 2026 at 9:04 AM shveta malik <[email protected]> wrote: > > > > > > > > > I see your point. I agree that using wal_receiver_status_interval for > > > this test may not be a reliable way. Can we attempt using > > > pg_wal_replay_pause() on standby and then checking > > > wait_event=WaitForStandbyConfirmation with backend_type=walsender on > > > primary? Or do you see any issues in this approach that I might be > > > overlooking? > > > > > > > Yes, I think we can make use of the WAL replay pause/resume mechanism. > > This seems like the right approach, as it gives us a more controlled > > and deterministic way to validate the lagging behavior. > > > > Looking at 049_wait_for_lsn.pl (the test case you referenced), it > explicitly stops the WAL receiver by setting primary_conninfo to an > empty string, rather than just pausing WAL replay.
Oh, I missed it in that testcase. Setting primary_conninfo to NULL essentially means not starting the walreceiver and thus making the standby slot as inactive, for which we already have a testcase. > Using > pg_wal_replay_pause() alone only halts replay; the WAL receiver > continues running, keeps receiving WAL, and sends feedback/status to > the primary. That feedback is sufficient to advance restart_lsn on the > standby’s slot, which would violate the restart_lsn < wait_for_lsn > condition inside StandbySlotsHaveCaughtup(), which is not what we > want. Yes, I see. IIUC, the same problem will be there if we use recovery_min_apply_delay i.e., WALs will be received, flushed and feedback will be sent to primary, only replay will be delayed. We can use 'synchronous_commit = remote_apply' along with 'recovery_min_apply_delay ' but that would mean delaying logical replication because transaction commit is blocking not because standby is actually lagging. It will not be a suitable test for 'synchronized_satndby_slots'. > > This leads to the question: can we construct a realistic test case > where a failover standby remains active (WAL receiver running) while > its restart_lsn is still genuinely lagging and consistently so? This > likely needs further exploration. > I have no more ideas here. We can get rid of lagging testcase. But let's wait for a day to see if Hou-San has any further ideas on how to write a deterministic testcase here. thanks Shveta
