Hi Amit,

On Sat, Sep 6, 2025 at 10:46 AM Amit Kapila <[email protected]> wrote:
>
> On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <[email protected]> wrote:
> >
> > Good to hear that you’re also interested in working on this task.
> >
> > On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <[email protected]> wrote:
> >>
> >> Hi Ashutosh,
> >>
> >> I am also interested in this thread. And was working on a patch for it.
> >>
> >> On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <[email protected]> 
> >> wrote:
> >> >
> >> > Hi Amit,
> >> >
> >> > On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <[email protected]> 
> >> > wrote:
> >> >>
> >> >> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma 
> >> >> <[email protected]> wrote:
> >> >> >
> >> >> > We have seen cases where slot synchronization gets delayed, for 
> >> >> > example when the slot is behind the failover standby or vice versa, 
> >> >> > and the slot sync worker has to wait for one to catch up with the 
> >> >> > other. During this waiting period, users querying 
> >> >> > pg_replication_slots can only see whether the slot has been 
> >> >> > synchronized or not. If it has already synchronized, that’s fine, but 
> >> >> > if synchronization is taking longer, users would naturally want to 
> >> >> > understand the reason for the delay.
> >> >> >
> >> >> > Is there a way for end users to know the cause of slot 
> >> >> > synchronization delays, so they can take appropriate actions to speed 
> >> >> > it up?
> >> >> >
> >> >> > I understand that server logs are emitted in such cases, but logs are 
> >> >> > not something end users would want to check regularly. Moreover, 
> >> >> > since logging is configuration-based, relevant messages may sometimes 
> >> >> > be skipped or suppressed.
> >> >> >
> >> >>
> >> >> Currently, the way to see the reason for sync skip is LOGs but I think
> >> >> it is better to add a new column like sync_skip_reason in
> >> >> pg_replication_slots. This can show the reasons like
> >> >> standby_LSN_ahead_remote_LSN. I think ideally users can compare
> >> >> standby's slot LSN/XMIN with remote_slot being synced. Do you have any
> >> >> better ideas?
> >> >>
> >> >
> >> > I have similar thoughts, but for clarity, I’d like to outline some of 
> >> > the key steps I plan to take:
> >> >
> >> > Step 1: Define an enum for all possible reasons a slot persistence was 
> >> > skipped.
> >> >
> >> > /*
> >> >  * Reasons why a replication slot sync was skipped.
> >> >  */
> >> > typedef enum ReplicationSlotSyncSkipReason
> >> > {
> >> >     RS_SYNC_SKIP_NONE = 0,                 /* No skip */
> >> >
> >> >     RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind 
> >> > local reserved LSN */
> >> >
> >> >     RS_SYNC_SKIP_DATA_LOSS = (1 << 1),     /* Local slot ahead of 
> >> > remote, risk of data loss */
> >> >
> >> >     RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2)    /* Standby could not build a 
> >> > consistent snapshot */
> >> > } ReplicationSlotSyncSkipReason;
> >> >
> >> > --
> >> >
> >> I think we should also add the case when "remote_slot->confirmed_lsn >
> >> latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
> >> is still not flushed on the Standby). In this case as well we are
> >> skipping the slot sync.
> >
> >
> > Yes, we can include this case as well.
> >
> >>
> >>
> >> > Step 2: Introduce new column to pg_replication_slots to store the skip 
> >> > reason
> >> >
> >> > /* Inside pg_replication_slots table */
> >> > ReplicationSlotSyncSkipReason slot_sync_skip_reason;
> >> >
> >> > --
> >> >
> >> As per the discussion [1], I think it is more of stat related data and
> >> we should add it in the pg_stat_replication_slots view. Also we can
> >> add columns for 'slot sync skip count' and 'last slot sync skip'.
> >> Thoughts?
> >
> >
> > It’s not a bad choice, but what makes it a bit confusing for me is that 
> > some of the slot sync information is stored in pg_replication_slots, while 
> > some is in pg_stat_replication_slots.
> >
>
> How about keeping sync_skip_reason in pg_replication_slots and
> sync_skip_count in pg_stat_replication_slots?
>

I think we can do that, since sync_skip_reason appears to be a
descriptive metadata rather than statistical data like
slot_sync_skip_count and last_slot_sync_skip. However, it's also true
that all three pieces of data are transient by nature - they will just
be present in the runtime.

--
With Regards,
Ashutosh Sharma.


Reply via email to