On Mon, Mar 23, 2026 at 9:51 AM surya poondla <[email protected]> wrote: > > Hi All, > > Thank you for reporting a real gap and building this feature to address it. > Very nice points were discussed in this thread. > > I reviewed the v20260318 patch and some comments. > > Documentation comments: > 1. FIRST mode does not specify what happens when valid slots < N > "If a slot is missing, logical, invalidated, or inactive, it will be skipped. > However, if a slot exists and is valid and active but has not yet caught up, > the system will wait for it rather than skipping to lower-priority slots." > This paragraph explains the skip/wait distinction clearly, but doesn't > clearly address what happens when, after skipping all > missing/invalid/inactive/logical slots, the number of remaining valid slots > is less than num_sync? > > For example, with FIRST 2 (sb1_slot, sb2_slot, sb3_slot): if sb1_slot and > sb2_slot are both invalidated and only sb3_slot is valid but lagging FIRST 2 > requires two slots, but only one candidate remains. > > Looking at the code in StandbySlotsHaveCaughtup(), when syncrep_method == > SYNC_REP_PRIORITY and a slot is lagging, the code does: > if (wait_for_all || synchronized_standby_slots_config->syncrep_method == > SYNC_REP_PRIORITY) > break; > > So the function breaks out of the loop and returns false. This is the correct > behavior, but it is not stated anywhere in the documentation. A user > encountering this scenario will not know whether to expect a wait or an > error. The documentation should state explicitly that in FIRST mode, if fewer > valid slots than num_sync are available, logical decoding waits indefinitely. > > 2. "Missing, logical, invalidated, or inactive slots are skipped when > determining candidates, and lagging slots simply do not count toward the > required number until they catch up" > This is correct for the case where some slots are skipped and others have > caught up. But it does not address the case where all listed slots are > lagging and every slot is healthy and connected, but none have reached > wait_for_lsn yet. In that situation, the code records each slot as > SS_SLOT_LAGGING, does goto next_slot for each (because syncrep_method == > SYNC_REP_QUORUM), and returns false because caught_up_slot_num < required. > Logical decoding waits. > > You can append the following sentence to the above documentation paragraph > "If fewer than num_sync slots have caught up at a given moment, logical > decoding waits until that threshold is reached." > > > Test comments: > 1. "PART D: Verify FIRST N priority semantics. # 3. Wait for valid but > lagging slots (not skip to lower priority)" > The test implements this by calling $standby1->stop. A stopped standby has no > active_pid, so the slot is classified as SS_SLOT_INACTIVE, not > SS_SLOT_LAGGING. > SS_SLOT_LAGGING means it is connected and streaming but restart_lsn < > wait_for_lsn. > > As Hou previously mentioned, recovery_min_apply_delay on standby1 would be > one way to keep it connected while forcing its WAL application to lag, > exercising the SS_SLOT_LAGGING code path directly. It is worth adding a test > that covers this, both for FIRST (to confirm it blocks) and for ANY (to > confirm it does not). > > Overall a great patch. >
Thank you - Hou, Surya and Dilip for your feedback. I will address all your comments in the next patch. -- With Regards, Ashutosh Sharma.
