Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

Ashutosh Sharma Mon, 16 Mar 2026 02:03:51 -0700

Hi,

On Mon, Mar 16, 2026 at 12:00 PM shveta malik <[email protected]> wrote:
>
> On Fri, Mar 13, 2026 at 3:39 PM Ashutosh Sharma <[email protected]> wrote:
> >
> > Hi,
> >
> > Thanks for the review, please find comments inline:
> >
> > On Fri, Mar 13, 2026 at 9:53 AM shveta malik <[email protected]> wrote:
> > >
> > >
> > > 1)
> > > StandbySlotsHaveCaughtup():
> > >
> > > + /* If no specific slot state issues but requirement not met (slots
> > > still lagging) */
> > > + if (num_slot_states == 0)
> > > + {
> > > + ereport(elevel,
> > > + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > > + errmsg("waiting for physical standbys corresponding to synchronized
> > > standby slots, some slots have not caught up"));
> > > + }
> > >
> > > Above means, when slots are lagging (not invalid), we will emit
> > > WARNING or even error out in some cases (based on elevel). IIUC, this
> > > behaviour is not the same as base code. Instead of ERROR, it used to
> > > wait for lagging slots. Is this change intentional?
> > >
> >
> > Yes it was added intentionally, but the amount of information it was
> > emitting wasn't very much useful. It was just saying "some slots have
> > not caught up" which is already known by the fact that we are waiting.
> > It doesn't say which slots are lagging or by how much, so it was not
> > outputting any actionable information.
> >
> > To make it more useful, this is how it has been changed now:
> >
> > 1) Add a SS_SLOT_LAGGING state to the enum to track slots that were
> > healthy but behind
> > 2) Add a restart_lsn field to SyncStandbySlotsStateInfo to record how
> > far behind each lagging slot is
> > 3) Record lagging slots in the scanning loop (currently they're just
> > skipped/broken out of without being saved)
> > 4) Pass wait_for_lsn to ReportUnavailableSyncStandbySlots so it can
> > report the gap
> >
> > With this change, we would see something like:
> >
> > "replication slot "slot1" has not caught up, the slot's restart_lsn
> > 0/1234 is behind the required 0/5678"
> >
> > This looks more useful and actionable to me.
> >
> > AFA elevel is concerned, it's mostly WARNING except when the server is
> > being stopped or when the cascaded standby is being promoted. So in
> > normal circumstances all these messages will always be reported as
> > WARNING, but still if you see any issues or you want the lagging slot
> > to be reported as WARNING always (even when the elevel is WARNING) do
> > let me know.
>
> I modified the patch to emit WARNING instead of ERROR on lagging case.
> I tested it for the case where standby is lagging, shutdown of primary
> waits for standby to catch up:
>
> --------------------
> WARNING:  replication slot "standby_1" specified in parameter
> "synchronized_standby_slots" has not caught up
> DETAIL:  The slot's restart_lsn 0/26D1D250 is behind the required 0/26D1D2C0.
> LOG:  received fast shutdown request
> LOG:  aborting any active transactions
> FATAL:  terminating connection due to administrator command
> LOG:  background worker "logical replication launcher" (PID 144863)
> exited with exit code 1
> LOG:  shutting down
> WARNING:  replication slot "standby_1" specified in parameter
> "synchronized_standby_slots" has not caught up
> DETAIL:  The slot's restart_lsn 0/26D1D250 is behind the required 0/26D1D2C0.
> WARNING:  replication slot "standby_1" specified in parameter
> "synchronized_standby_slots" has not caught up
> DETAIL:  The slot's restart_lsn 0/26D1D250 is behind the required 0/26D1D2C0.
> LOG:  released logical replication slot "sub1"
> ----------------------
>
> This is how it is supposed to be as we want standby to catch-up before
> the walsender exits on shutdown. Please see below code in
> WalSndWaitForWal():
>
> /*
>  * If postmaster asked us to stop and the standby slots have caught up
>  * to the flushed position, don't wait anymore.
>  *
>  * It's important to do this check after the recomputation of
>  * RecentFlushPtr, so we can send all remaining data before shutting
>  * down.
>  */
> if (got_STOPPING)
> {
>   if (NeedToWaitForStandbys(RecentFlushPtr, &wait_event))
>           wait_for_standby_at_stop = true;
>   else
>           break;
> }
>
>
> But with the new change, walsender will ERROR out if standby is
> lagging during shutdown. So we should avoid that. We can even skip
> that complete section (case SS_SLOT_LAGGING in
> ReportUnavailableSyncStandbySlots). But if we really want to have some
> information for the user, we can make it DEBUG rather than Warning.
> IMO, warning sounds alarming for the case where the user can not do
> much. But let's see what others have to say on this.
>


Yes, we should avoid using ERROR in this case. DEBUG looks like a
better choice to me, as WARNING can be somewhat noisy at times and
this appears to be more like debugging information. However, as you
suggested, we can wait and see what others think.

> > > 2)
> > > Do you think we can move the reporting part (the below for-loop) to a
> > > new function (may be report_unavailable_sync_standby_slots or anything
> > > better) to keep the caught-up logic clean?
> > > + for (int i = 0; i < num_slot_states; i++)
> > >
> >
> > Okay, moved the entire reporting logic to
> > ReportUnavailableSyncStandbySlots(). This function name pattern was
> > chosen considering the name of other functions in the file.
>
> Thanks, it has better readability now.
>
> >
> > > 3)
> > >
> > > + /*Record Slot State */
> > > Space needed before Record.
> > >
> >
> > Added missing space.
> >
> > > 4)
> > > + SS_SLOT_OK, /* slot is valid and can participate */
> > > Do we need this value, we are not using it anywhere.
> > >
> >
> > This was added just to make the enum look a little self documenting,
> > but it's true that it's not being used anywhere hence I removed it.
> >
> > > 5)
> > > check_synchronized_standby_slots():
> > > + /* Reject num_sync > nmembers — it can never be satisfied */
> > > + if (syncrep_parse_result->num_sync > syncrep_parse_result->nmembers)
> > > + {
> > > + GUC_check_errmsg("number of synchronized standby slots (%d) must not
> > > exceed the number of listed slots (%d)",
> > > + syncrep_parse_result->num_sync,
> > > + syncrep_parse_result->nmembers);
> > > + return false;
> > > + }
> > >
> > > Do we need this? Is it ever a possibility? Even if it happens, IIUC,
> > > it will be our internal parser logic issue and not the user's GUC
> > > configuration fault. So do we mean it to be user facing GUC error?
> > >
> >
> > We need this to catch invalid settings like 'FIRST 5 (s1, s2, s3);'.
> > AFAICS this is actually a missing part in the check hook for
> > "synchronous_standby_names". It should have been present here as well.
>
> Okay, I see this is for a wrong configuration.
>
> > > 6)
> > > If we plan to retain above, we shall move it before the previous 'if' 
> > > block:
> > >
> > > + if (syncrep_parse_result->num_sync == 1 &&
> > > + syncrep_parse_result->syncrep_method == SYNC_REP_PRIORITY &&
> > > + syncrep_parse_result->nmembers > 1)
> > >
> >
> > I don't see any benefit in moving it upward, but it doesn't have any
> > functional harm as well, anyways I have just moved it up to the place
> > you suggested.
>
> Please do as you feel fit. This comment was when I did not not have a
> valid scenario in mind for the concerned case.
>
> > PFA patch with all the changes mentioned above and let me know if you
> > have any further comments.
> >
>
> Will review the rest of the changes.
>

Sure, thanks.

--
With Regards,
Ashutosh Sharma.

Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication

Reply via email to