Re: How can end users know the cause of LR slot sync delays?

Amit Kapila Sat, 06 Sep 2025 06:41:47 -0700

On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <[email protected]> wrote:
>
> Good to hear that you’re also interested in working on this task.
>
> On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <[email protected]> wrote:
>>
>> Hi Ashutosh,
>>
>> I am also interested in this thread. And was working on a patch for it.
>>
>> On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <[email protected]> wrote:
>> >
>> > Hi Amit,
>> >
>> > On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <[email protected]> 
>> > wrote:
>> >>
>> >> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <[email protected]> 
>> >> wrote:
>> >> >
>> >> > We have seen cases where slot synchronization gets delayed, for example 
>> >> > when the slot is behind the failover standby or vice versa, and the 
>> >> > slot sync worker has to wait for one to catch up with the other. During 
>> >> > this waiting period, users querying pg_replication_slots can only see 
>> >> > whether the slot has been synchronized or not. If it has already 
>> >> > synchronized, that’s fine, but if synchronization is taking longer, 
>> >> > users would naturally want to understand the reason for the delay.
>> >> >
>> >> > Is there a way for end users to know the cause of slot synchronization 
>> >> > delays, so they can take appropriate actions to speed it up?
>> >> >
>> >> > I understand that server logs are emitted in such cases, but logs are 
>> >> > not something end users would want to check regularly. Moreover, since 
>> >> > logging is configuration-based, relevant messages may sometimes be 
>> >> > skipped or suppressed.
>> >> >
>> >>
>> >> Currently, the way to see the reason for sync skip is LOGs but I think
>> >> it is better to add a new column like sync_skip_reason in
>> >> pg_replication_slots. This can show the reasons like
>> >> standby_LSN_ahead_remote_LSN. I think ideally users can compare
>> >> standby's slot LSN/XMIN with remote_slot being synced. Do you have any
>> >> better ideas?
>> >>
>> >
>> > I have similar thoughts, but for clarity, I’d like to outline some of the 
>> > key steps I plan to take:
>> >
>> > Step 1: Define an enum for all possible reasons a slot persistence was 
>> > skipped.
>> >
>> > /*
>> >  * Reasons why a replication slot sync was skipped.
>> >  */
>> > typedef enum ReplicationSlotSyncSkipReason
>> > {
>> >     RS_SYNC_SKIP_NONE = 0,                 /* No skip */
>> >
>> >     RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local 
>> > reserved LSN */
>> >
>> >     RS_SYNC_SKIP_DATA_LOSS = (1 << 1),     /* Local slot ahead of remote, 
>> > risk of data loss */
>> >
>> >     RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2)    /* Standby could not build a 
>> > consistent snapshot */
>> > } ReplicationSlotSyncSkipReason;
>> >
>> > --
>> >
>> I think we should also add the case when "remote_slot->confirmed_lsn >
>> latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
>> is still not flushed on the Standby). In this case as well we are
>> skipping the slot sync.
>
>
> Yes, we can include this case as well.
>
>>
>>
>> > Step 2: Introduce new column to pg_replication_slots to store the skip 
>> > reason
>> >
>> > /* Inside pg_replication_slots table */
>> > ReplicationSlotSyncSkipReason slot_sync_skip_reason;
>> >
>> > --
>> >
>> As per the discussion [1], I think it is more of stat related data and
>> we should add it in the pg_stat_replication_slots view. Also we can
>> add columns for 'slot sync skip count' and 'last slot sync skip'.
>> Thoughts?
>
>
> It’s not a bad choice, but what makes it a bit confusing for me is that some 
> of the slot sync information is stored in pg_replication_slots, while some is 
> in pg_stat_replication_slots.
>


How about keeping sync_skip_reason in pg_replication_slots and
sync_skip_count in pg_stat_replication_slots?

-- 
With Regards,
Amit Kapila.

Re: How can end users know the cause of LR slot sync delays?

Reply via email to