On Fri, Sep 5, 2025 at 12:50 PM Ashutosh Sharma <ashu.coe...@gmail.com> wrote: > > Good to hear that you’re also interested in working on this task. > > On Thu, Sep 4, 2025 at 8:26 PM Shlok Kyal <shlok.kyal....@gmail.com> wrote: >> >> Hi Ashutosh, >> >> I am also interested in this thread. And was working on a patch for it. >> >> On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coe...@gmail.com> wrote: >> > >> > Hi Amit, >> > >> > On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapil...@gmail.com> >> > wrote: >> >> >> >> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coe...@gmail.com> >> >> wrote: >> >> > >> >> > We have seen cases where slot synchronization gets delayed, for example >> >> > when the slot is behind the failover standby or vice versa, and the >> >> > slot sync worker has to wait for one to catch up with the other. During >> >> > this waiting period, users querying pg_replication_slots can only see >> >> > whether the slot has been synchronized or not. If it has already >> >> > synchronized, that’s fine, but if synchronization is taking longer, >> >> > users would naturally want to understand the reason for the delay. >> >> > >> >> > Is there a way for end users to know the cause of slot synchronization >> >> > delays, so they can take appropriate actions to speed it up? >> >> > >> >> > I understand that server logs are emitted in such cases, but logs are >> >> > not something end users would want to check regularly. Moreover, since >> >> > logging is configuration-based, relevant messages may sometimes be >> >> > skipped or suppressed. >> >> > >> >> >> >> Currently, the way to see the reason for sync skip is LOGs but I think >> >> it is better to add a new column like sync_skip_reason in >> >> pg_replication_slots. This can show the reasons like >> >> standby_LSN_ahead_remote_LSN. I think ideally users can compare >> >> standby's slot LSN/XMIN with remote_slot being synced. Do you have any >> >> better ideas? >> >> >> > >> > I have similar thoughts, but for clarity, I’d like to outline some of the >> > key steps I plan to take: >> > >> > Step 1: Define an enum for all possible reasons a slot persistence was >> > skipped. >> > >> > /* >> > * Reasons why a replication slot sync was skipped. >> > */ >> > typedef enum ReplicationSlotSyncSkipReason >> > { >> > RS_SYNC_SKIP_NONE = 0, /* No skip */ >> > >> > RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local >> > reserved LSN */ >> > >> > RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, >> > risk of data loss */ >> > >> > RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a >> > consistent snapshot */ >> > } ReplicationSlotSyncSkipReason; >> > >> > -- >> > >> I think we should also add the case when "remote_slot->confirmed_lsn > >> latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot >> is still not flushed on the Standby). In this case as well we are >> skipping the slot sync. > > > Yes, we can include this case as well. > >> >> >> > Step 2: Introduce new column to pg_replication_slots to store the skip >> > reason >> > >> > /* Inside pg_replication_slots table */ >> > ReplicationSlotSyncSkipReason slot_sync_skip_reason; >> > >> > -- >> > >> As per the discussion [1], I think it is more of stat related data and >> we should add it in the pg_stat_replication_slots view. Also we can >> add columns for 'slot sync skip count' and 'last slot sync skip'. >> Thoughts? > > > It’s not a bad choice, but what makes it a bit confusing for me is that some > of the slot sync information is stored in pg_replication_slots, while some is > in pg_stat_replication_slots. >
How about keeping sync_skip_reason in pg_replication_slots and sync_skip_count in pg_stat_replication_slots? -- With Regards, Amit Kapila.