RE: Newly created replication slot may be invalidated by checkpoint

Zhijie Hou (Fujitsu) Thu, 20 Nov 2025 02:38:09 -0800

On Thursday, November 20, 2025 4:26 PM Vitaly Davydov 
<[email protected]> wrote:
> 
> Hi Zhijie Hou
> 
> I'm not sure, my previous email was successfully sent. I do not see it in the
> pgsql-hackers mailing list. Re-sending it again, sorry for that.
> 
> Thank you for preparing the patch!
> 
> The new lock schema (Allocation, Control) seems to be working, without any
> deadlocks. It is guaranteed by the lock orders - (1) Allocation lock is 
> followed
> by the Control lock, or (2) the Control lock without the Allocation lock. I've
> attached the doc [1] where I tried to describe the lock schema to analyze
> possible deadlocks. I haven't found any evident problems.


Thanks a lot for the test and analysis.

> 
> Concerning reserve_wal_for_local_slot. It seems it is used in synchronization
> of failover logical slots. For me, it is tricky to change restart_lsn of a 
> synced
> logical slot to RedoRecPtr, because it may lead to problems with logical
> replication using such slot after the replica promotion. But it seems it is 
> the
> architectural problem and it is not related to the problems, solved by the
> patch.

I think this is not an issue because if we use the redo pointer instead of the
remote restart_lsn as the initial value, the synced slot won't be marked as
sync-ready, so user cannot use it after promotion (also well documented). This
is also the existing behavior before the patch, e.g., if the required WALs were
removed, the oldest available WAL was used as the initial value, similarly
resulting in the slot not being sync-ready.

> 
> The change of lock mode to EXCLUSIVE in
> ReplicationSlotsComputeRequiredLSN may affect the performance when a lot
> of slots are advanced during some small period of time. It may affect the
> walsender performance. It advances the logical or physical slots when receive
> a confirmation from the replica. I guess, the slot advancement may be pretty
> frequent operation.

Yes, I had the same thought and considered a simple alternative (similar to your
suggestion below): use an exclusive lock only when updating the slot.restart_lsn
during WAL reservation, while continuing to use a shared lock in the computation
function. Additionally, place XLogSetReplicationSlotMinimumLSN() under the lock.
This approach will also help serialize the process.

> 
> There is an idea to keep the SHARED lock but put
> XLogSetReplicationSlotMinimumLSN under this lock and implement some
> guarantees that we do not advance slots to the past, taking into account that
> the slot can be advanced in past if it doesn't cross wal segment boundaries 
> (it
> happens). In this case, if a concurrent process advances an existing slot, 
> its old
> restart_lsn will protect the wal. In case of wal reservation we may use
> EXCLUSIVE lock in XLogSetReplicationSlotMinimumLSN.
> 
> Furthremore, I believe, ReplicationSlotsComputeRequiredLSN is required for
> checkpointer to calculate the oldest wal segment but it is not required to be
> called every time when a slot is advanced. It may affect the reports -
> GetWALAvailability function, but I think it is not a big problem to deal with 
> it.

I also think it's not a problem.

> 
> Some typos in the patch commit message:
> 
> 1) A typo: yet updated estart_lsn, while the
> 
> 2) If a backend advances a slot's restart_lsn reaches
> XLogSetReplicationSlotMinimumLSN... - may be to put 'and' before reaches?

Fixed.

Best Regards,
Hou zj

v5-0001-Fix-race-conditions-causing-invalidation-of-newly.patch
Description: v5-0001-Fix-race-conditions-causing-invalidation-of-newly.patch

RE: Newly created replication slot may be invalidated by checkpoint

Reply via email to