Hi Amit, On Sat, Sep 6, 2025 at 10:46 AM Amit Kapila <[email protected]> wrote: > > On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <[email protected]> wrote: > > > > Good to hear that you’re also interested in working on this task. > > > > On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <[email protected]> wrote: > >> > >> Hi Ashutosh, > >> > >> I am also interested in this thread. And was working on a patch for it. > >> > >> On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <[email protected]> > >> wrote: > >> > > >> > Hi Amit, > >> > > >> > On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <[email protected]> > >> > wrote: > >> >> > >> >> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma > >> >> <[email protected]> wrote: > >> >> > > >> >> > We have seen cases where slot synchronization gets delayed, for > >> >> > example when the slot is behind the failover standby or vice versa, > >> >> > and the slot sync worker has to wait for one to catch up with the > >> >> > other. During this waiting period, users querying > >> >> > pg_replication_slots can only see whether the slot has been > >> >> > synchronized or not. If it has already synchronized, that’s fine, but > >> >> > if synchronization is taking longer, users would naturally want to > >> >> > understand the reason for the delay. > >> >> > > >> >> > Is there a way for end users to know the cause of slot > >> >> > synchronization delays, so they can take appropriate actions to speed > >> >> > it up? > >> >> > > >> >> > I understand that server logs are emitted in such cases, but logs are > >> >> > not something end users would want to check regularly. Moreover, > >> >> > since logging is configuration-based, relevant messages may sometimes > >> >> > be skipped or suppressed. > >> >> > > >> >> > >> >> Currently, the way to see the reason for sync skip is LOGs but I think > >> >> it is better to add a new column like sync_skip_reason in > >> >> pg_replication_slots. This can show the reasons like > >> >> standby_LSN_ahead_remote_LSN. I think ideally users can compare > >> >> standby's slot LSN/XMIN with remote_slot being synced. Do you have any > >> >> better ideas? > >> >> > >> > > >> > I have similar thoughts, but for clarity, I’d like to outline some of > >> > the key steps I plan to take: > >> > > >> > Step 1: Define an enum for all possible reasons a slot persistence was > >> > skipped. > >> > > >> > /* > >> > * Reasons why a replication slot sync was skipped. > >> > */ > >> > typedef enum ReplicationSlotSyncSkipReason > >> > { > >> > RS_SYNC_SKIP_NONE = 0, /* No skip */ > >> > > >> > RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind > >> > local reserved LSN */ > >> > > >> > RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of > >> > remote, risk of data loss */ > >> > > >> > RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a > >> > consistent snapshot */ > >> > } ReplicationSlotSyncSkipReason; > >> > > >> > -- > >> > > >> I think we should also add the case when "remote_slot->confirmed_lsn > > >> latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot > >> is still not flushed on the Standby). In this case as well we are > >> skipping the slot sync. > > > > > > Yes, we can include this case as well. > > > >> > >> > >> > Step 2: Introduce new column to pg_replication_slots to store the skip > >> > reason > >> > > >> > /* Inside pg_replication_slots table */ > >> > ReplicationSlotSyncSkipReason slot_sync_skip_reason; > >> > > >> > -- > >> > > >> As per the discussion [1], I think it is more of stat related data and > >> we should add it in the pg_stat_replication_slots view. Also we can > >> add columns for 'slot sync skip count' and 'last slot sync skip'. > >> Thoughts? > > > > > > It’s not a bad choice, but what makes it a bit confusing for me is that > > some of the slot sync information is stored in pg_replication_slots, while > > some is in pg_stat_replication_slots. > > > > How about keeping sync_skip_reason in pg_replication_slots and > sync_skip_count in pg_stat_replication_slots? >
I think we can do that, since sync_skip_reason appears to be a descriptive metadata rather than statistical data like slot_sync_skip_count and last_slot_sync_skip. However, it's also true that all three pieces of data are transient by nature - they will just be present in the runtime. -- With Regards, Ashutosh Sharma.
