Hi Amit, On Thu, Aug 28, 2025 at 3:26 PM Amit Kapila <[email protected]> wrote:
> On Thu, Aug 28, 2025 at 11:07 AM Ashutosh Sharma <[email protected]> > wrote: > > > > We have seen cases where slot synchronization gets delayed, for example > when the slot is behind the failover standby or vice versa, and the slot > sync worker has to wait for one to catch up with the other. During this > waiting period, users querying pg_replication_slots can only see whether > the slot has been synchronized or not. If it has already synchronized, > that’s fine, but if synchronization is taking longer, users would naturally > want to understand the reason for the delay. > > > > Is there a way for end users to know the cause of slot synchronization > delays, so they can take appropriate actions to speed it up? > > > > I understand that server logs are emitted in such cases, but logs are > not something end users would want to check regularly. Moreover, since > logging is configuration-based, relevant messages may sometimes be skipped > or suppressed. > > > > Currently, the way to see the reason for sync skip is LOGs but I think > it is better to add a new column like sync_skip_reason in > pg_replication_slots. This can show the reasons like > standby_LSN_ahead_remote_LSN. I think ideally users can compare > standby's slot LSN/XMIN with remote_slot being synced. Do you have any > better ideas? > > I have similar thoughts, but for clarity, I’d like to outline some of the key steps I plan to take: Step 1: Define an enum for all possible reasons a slot persistence was skipped. /* * Reasons why a replication slot sync was skipped. */ typedef enum ReplicationSlotSyncSkipReason { RS_SYNC_SKIP_NONE = 0, /* No skip */ RS_SYNC_SKIP_REMOTE_BEHIND = (1 << 0), /* Remote slot is behind local reserved LSN */ RS_SYNC_SKIP_DATA_LOSS = (1 << 1), /* Local slot ahead of remote, risk of data loss */ RS_SYNC_SKIP_NO_SNAPSHOT = (1 << 2) /* Standby could not build a consistent snapshot */ } ReplicationSlotSyncSkipReason; -- Step 2: Introduce new column to pg_replication_slots to store the skip reason /* Inside pg_replication_slots table */ ReplicationSlotSyncSkipReason slot_sync_skip_reason; -- Step 3: Function to convert enum to human-readable string that can be stored in pg_replication_slots. /* * Convert ReplicationSlotSyncSkipReason bitmask to human-readable string. * * Returns a palloc'd string; caller is responsible for freeing it. */ static char * replication_slot_sync_skip_reason_str(ReplicationSlotSyncSkipReason reason) { StringInfoData buf; initStringInfo(&buf); if (reason == RS_SYNC_SKIP_NONE) { appendStringInfoString(&buf, "none"); return buf.data; } if (reason & RS_SYNC_SKIP_REMOTE_BEHIND) appendStringInfoString(&buf, "remote_behind|"); if (reason & RS_SYNC_SKIP_DATA_LOSS) appendStringInfoString(&buf, "data_loss|"); if (reason & RS_SYNC_SKIP_NO_SNAPSHOT) appendStringInfoString(&buf, "no_snapshot|"); /* Remove trailing '|' */ if (buf.len > 0 && buf.data[buf.len - 1] == '|') buf.data[buf.len - 1] = '\0'; return buf.data; } -- Step 4: Capture slot_sync_skip_reason whenever the relevant LOG messages are generated, primarily inside update_local_synced_slot or update_and_persist_local_synced_slot. This value will can later be persisted in the pg_replication_slots catalog. -- Please let me know if you have any objections. I’ll share the wip patch in a few days. -- With Regards, Ashutosh Sharma.
