Hi Ashutosh,

I am also interested in this thread. And was working on a patch for it.

On Wed, 3 Sept 2025 at 17:52, Ashutosh Sharma <ashu.coe...@gmail.com> wrote:
>
> Hi Amit,
>
> On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <amit.kapil...@gmail.com> wrote:
>>
>> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <ashu.coe...@gmail.com> 
>> wrote:
>> >
>> > We have seen cases where slot synchronization gets delayed, for example 
>> > when the slot is behind the failover standby or vice versa, and the slot 
>> > sync worker has to wait for one to catch up with the other. During this 
>> > waiting period, users querying pg_replication_slots can only see whether 
>> > the slot has been synchronized or not. If it has already synchronized, 
>> > that’s fine, but if synchronization is taking longer, users would 
>> > naturally want to understand the reason for the delay.
>> >
>> > Is there a way for end users to know the cause of slot synchronization 
>> > delays, so they can take appropriate actions to speed it up?
>> >
>> > I understand that server logs are emitted in such cases, but logs are not 
>> > something end users would want to check regularly. Moreover, since logging 
>> > is configuration-based, relevant messages may sometimes be skipped or 
>> > suppressed.
>> >
>>
>> Currently, the way to see the reason for sync skip is LOGs but I think
>> it is better to add a new column like sync_skip_reason in
>> pg_replication_slots. This can show the reasons like
>> standby_LSN_ahead_remote_LSN. I think ideally users can compare
>> standby's slot LSN/XMIN with remote_slot being synced. Do you have any
>> better ideas?
>>
>
> I have similar thoughts, but for clarity, I’d like to outline some of the key 
> steps I plan to take:
>
> Step 1: Define an enum for all possible reasons a slot persistence was 
> skipped.
>
> /*
>  * Reasons why a replication slot sync was skipped.
>  */
> typedef enum ReplicationSlotSyncSkipReason
> {
>     RS_SYNC_SKIP_NONE = 0,                 /* No skip */
>
>     RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local 
> reserved LSN */
>
>     RS_SYNC_SKIP_DATA_LOSS = (1 << 1),     /* Local slot ahead of remote, 
> risk of data loss */
>
>     RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2)    /* Standby could not build a 
> consistent snapshot */
> } ReplicationSlotSyncSkipReason;
>
> --
>
I think we should also add the case when "remote_slot->confirmed_lsn >
latestFlushPtr" (WAL corresponding to the confirmed lsn on remote slot
is still not flushed on the Standby). In this case as well we are
skipping the slot sync.

> Step 2: Introduce new column to pg_replication_slots to store the skip reason
>
> /* Inside pg_replication_slots table */
> ReplicationSlotSyncSkipReason slot_sync_skip_reason;
>
> --
>
As per the discussion [1], I think it is more of stat related data and
we should add it in the pg_stat_replication_slots view. Also we can
add columns for 'slot sync skip count' and 'last slot sync skip'.
Thoughts?

> Step 3: Function to convert enum to human-readable string that can be stored 
> in pg_replication_slots.
>
> /*
>  * Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
>  *
>  * Returns a palloc'd string; caller is responsible for freeing it.
>  */
> static char *
> replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
> {
>     StringInfoData buf;
>     initStringInfo(&buf);
>
>     if (reason == RS_SYNC_SKIP_NONE)
>     {
>         appendStringInfoString(&buf, "none");
>         return buf.data;
>     }
>
>     if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
>         appendStringInfoString(&buf, "remote_behind|");
>     if (reason & RS_SYNC_SKIP_DATA_LOSS)
>         appendStringInfoString(&buf, "data_loss|");
>     if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
>         appendStringInfoString(&buf, "no_snapshot|");
>
>     /* Remove trailing '|' */
>     if (buf.len > 0 && buf.data[buf.len - 1] == '|')
>         buf.data[buf.len - 1] = '\0';
>
>     return buf.data;
> }
>
> --
>
Why are we showing the cause of the slot sync delay as an aggregate of
all causes occuring? I thought we should show the reason for the last
slot sync delay?

> Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages are 
> generated, primarily inside update_local_synced_slot or 
> update_and_persist_local_synced_slot. This value will can later be persisted 
> in the pg_replication_slots catalog.
>
> --
>
> Please let me know if you have any objections. I’ll share the wip patch in a 
> few days.
>
> --
I have attached a patch which I have worked on.

Thanks,
Shlok Kyal

Attachment: v1-0001-Add-stats-related-to-slot-sync-skip.patch
Description: Binary data

Reply via email to