Hi Amit,

On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <[email protected]> wrote:

> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <[email protected]>
> wrote:
> >
> > We have seen cases where slot synchronization gets delayed, for example
> when the slot is behind the failover standby or vice versa, and the slot
> sync worker has to wait for one to catch up with the other. During this
> waiting period, users querying pg_replication_slots can only see whether
> the slot has been synchronized or not. If it has already synchronized,
> that’s fine, but if synchronization is taking longer, users would naturally
> want to understand the reason for the delay.
> >
> > Is there a way for end users to know the cause of slot synchronization
> delays, so they can take appropriate actions to speed it up?
> >
> > I understand that server logs are emitted in such cases, but logs are
> not something end users would want to check regularly. Moreover, since
> logging is configuration-based, relevant messages may sometimes be skipped
> or suppressed.
> >
>
> Currently, the way to see the reason for sync skip is LOGs but I think
> it is better to add a new column like sync_skip_reason in
> pg_replication_slots. This can show the reasons like
> standby_LSN_ahead_remote_LSN. I think ideally users can compare
> standby's slot LSN/XMIN with remote_slot being synced. Do you have any
> better ideas?
>
>
I have similar thoughts, but for clarity, I’d like to outline some of the
key steps I plan to take:

Step 1: Define an enum for all possible reasons a slot persistence was
skipped.

/*
 * Reasons why a replication slot sync was skipped.
 */
typedef enum ReplicationSlotSyncSkipReason
{
    RS_SYNC_SKIP_NONE = 0,                 /* No skip */

    RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local
reserved LSN */

    RS_SYNC_SKIP_DATA_LOSS = (1 << 1),     /* Local slot ahead of remote,
risk of data loss */

    RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2)    /* Standby could not build a
consistent snapshot */
} ReplicationSlotSyncSkipReason;

--

Step 2: Introduce new column to pg_replication_slots to store the skip
reason

/* Inside pg_replication_slots table */
ReplicationSlotSyncSkipReason slot_sync_skip_reason;

--

Step 3: Function to convert enum to human-readable string that can be
stored in pg_replication_slots.

/*
 * Convert ReplicationSlotSyncSkipReason bitmask to human-readable string.
 *
 * Returns a palloc'd string; caller is responsible for freeing it.
 */
static char *
replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason)
{
    StringInfoData buf;
    initStringInfo(&buf);

    if (reason == RS_SYNC_SKIP_NONE)
    {
        appendStringInfoString(&buf, "none");
        return buf.data;
    }

    if (reason & RS_SYNC_SKIP_REMOTE_BEHIND)
        appendStringInfoString(&buf, "remote_behind|");
    if (reason & RS_SYNC_SKIP_DATA_LOSS)
        appendStringInfoString(&buf, "data_loss|");
    if (reason & RS_SYNC_SKIP_NO_SNAPSHOT)
        appendStringInfoString(&buf, "no_snapshot|");

    /* Remove trailing '|' */
    if (buf.len > 0 && buf.data[buf.len - 1] == '|')
        buf.data[buf.len - 1] = '\0';

    return buf.data;
}

--

Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages
are generated, primarily inside update_local_synced_slot or
update_and_persist_local_synced_slot. This value will can later be
persisted in the pg_replication_slots catalog.

--

Please let me know if you have any objections. I’ll share the wip patch in
a few days.

--
With Regards,
Ashutosh Sharma.

Reply via email to