Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

Amit Kapila Tue, 17 Mar 2026 02:26:23 -0700

On Mon, Mar 16, 2026 at 2:33 PM Ashutosh Sharma <[email protected]> wrote:
>
> On Mon, Mar 16, 2026 at 12:00 PM shveta malik <[email protected]> wrote:
> >
> > On Fri, Mar 13, 2026 at 3:39 PM Ashutosh Sharma <[email protected]> 
> > wrote:
> > >
> > > Hi,
> > >
> > > Thanks for the review, please find comments inline:
> > >
> > > On Fri, Mar 13, 2026 at 9:53 AM shveta malik <[email protected]> 
> > > wrote:
> > > >
> > > >
> > > > 1)
> > > > StandbySlotsHaveCaughtup():
> > > >
> > > > + /* If no specific slot state issues but requirement not met (slots
> > > > still lagging) */
> > > > + if (num_slot_states == 0)
> > > > + {
> > > > + ereport(elevel,
> > > > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > > + errmsg("waiting for physical standbys corresponding to synchronized
> > > > standby slots, some slots have not caught up"));
> > > > + }
> > > >
> > > > Above means, when slots are lagging (not invalid), we will emit
> > > > WARNING or even error out in some cases (based on elevel). IIUC, this
> > > > behaviour is not the same as base code. Instead of ERROR, it used to
> > > > wait for lagging slots. Is this change intentional?
> > > >
> > >
> > > Yes it was added intentionally, but the amount of information it was
> > > emitting wasn't very much useful. It was just saying "some slots have
> > > not caught up" which is already known by the fact that we are waiting.
> > > It doesn't say which slots are lagging or by how much, so it was not
> > > outputting any actionable information.
> > >
> > > To make it more useful, this is how it has been changed now:
> > >
> > > 1) Add a SS_SLOT_LAGGING state to the enum to track slots that were
> > > healthy but behind
> > > 2) Add a restart_lsn field to SyncStandbySlotsStateInfo to record how
> > > far behind each lagging slot is
> > > 3) Record lagging slots in the scanning loop (currently they're just
> > > skipped/broken out of without being saved)
> > > 4) Pass wait_for_lsn to ReportUnavailableSyncStandbySlots so it can
> > > report the gap
> > >
> > > With this change, we would see something like:
> > >
> > > "replication slot "slot1" has not caught up, the slot's restart_lsn
> > > 0/1234 is behind the required 0/5678"
> > >
> > > This looks more useful and actionable to me.
> > >
> > > AFA elevel is concerned, it's mostly WARNING except when the server is
> > > being stopped or when the cascaded standby is being promoted. So in
> > > normal circumstances all these messages will always be reported as
> > > WARNING, but still if you see any issues or you want the lagging slot
> > > to be reported as WARNING always (even when the elevel is WARNING) do
> > > let me know.
> >
> > I modified the patch to emit WARNING instead of ERROR on lagging case.
> > I tested it for the case where standby is lagging, shutdown of primary
> > waits for standby to catch up:
> >
> > --------------------
> > WARNING:  replication slot "standby_1" specified in parameter
> > "synchronized_standby_slots" has not caught up
> > DETAIL:  The slot's restart_lsn 0/26D1D250 is behind the required 
> > 0/26D1D2C0.
> > LOG:  received fast shutdown request
> > LOG:  aborting any active transactions
> > FATAL:  terminating connection due to administrator command
> > LOG:  background worker "logical replication launcher" (PID 144863)
> > exited with exit code 1
> > LOG:  shutting down
> > WARNING:  replication slot "standby_1" specified in parameter
> > "synchronized_standby_slots" has not caught up
> > DETAIL:  The slot's restart_lsn 0/26D1D250 is behind the required 
> > 0/26D1D2C0.
> > WARNING:  replication slot "standby_1" specified in parameter
> > "synchronized_standby_slots" has not caught up
> > DETAIL:  The slot's restart_lsn 0/26D1D250 is behind the required 
> > 0/26D1D2C0.
> > LOG:  released logical replication slot "sub1"
> > ----------------------
> >
> > This is how it is supposed to be as we want standby to catch-up before
> > the walsender exits on shutdown. Please see below code in
> > WalSndWaitForWal():
> >
> > /*
> >  * If postmaster asked us to stop and the standby slots have caught up
> >  * to the flushed position, don't wait anymore.
> >  *
> >  * It's important to do this check after the recomputation of
> >  * RecentFlushPtr, so we can send all remaining data before shutting
> >  * down.
> >  */
> > if (got_STOPPING)
> > {
> >   if (NeedToWaitForStandbys(RecentFlushPtr, &wait_event))
> >           wait_for_standby_at_stop = true;
> >   else
> >           break;
> > }
> >
> >
> > But with the new change, walsender will ERROR out if standby is
> > lagging during shutdown. So we should avoid that. We can even skip
> > that complete section (case SS_SLOT_LAGGING in
> > ReportUnavailableSyncStandbySlots). But if we really want to have some
> > information for the user, we can make it DEBUG rather than Warning.
> > IMO, warning sounds alarming for the case where the user can not do
> > much. But let's see what others have to say on this.
> >
>
> Yes, we should avoid using ERROR in this case. DEBUG looks like a
> better choice to me, as WARNING can be somewhat noisy at times and
> this appears to be more like debugging information. However, as you
> suggested, we can wait and see what others think.
>


+1 for elevel as DEBUG1 in this case. BTW, IIRC, we need to wait here
because otherwise standby's may miss some information that is sent to
logical subscribers. This can happen because the walsender wait for
all WAL records to be sent to logical subscribers before shutdown (as
mentioned in the top of file in para: "If the server is shut down ..."
and the code in WalSndDone() suggests the same).

-- 
With Regards,
Amit Kapila.

Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

Reply via email to