Prevent invalidation of newly created replication slots.

A race condition could cause a newly created replication slot to become
invalidated between WAL reservation and a checkpoint.

Previously, if the required WAL was removed, we retried the reservation
process. However, the slot could still be invalidated before the retry if
the WAL was not yet removed but the checkpoint advanced the redo pointer
beyond the slot's intended restart LSN and computed the minimum LSN that
needs to be preserved for the slots.

The fix is to acquire an exclusive lock on ReplicationSlotAllocationLock
during WAL reservation to serialize WAL reservation and checkpoint's
minimum restart_lsn computation. This ensures that, if WAL reservation
occurs first, the checkpoint waits until restart_lsn is updated before
removing WAL. If the checkpoint runs first, subsequent WAL reservations
pick a position at or after the latest checkpoint's redo pointer.

We can't use the same fix for branch 17 and prior because commit
2090edc6f3 changed to compute to the minimum restart_LSN among slot's at
the beginning of checkpoint (or restart point). The fix for 17 and prior
branches is under discussion and will be committed separately.

Reported-by: suyu.cmj <[email protected]>
Author: Hou Zhijie <[email protected]>
Reviewed-by: Vitaly Davydov <[email protected]>
Reviewed-by: Masahiko Sawada <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Backpatch-through: 18
Discussion: 
https://postgr.es/m/5e045179-236f-4f8f-84f1-0f2566ba784c.mengjuan....@alibaba-inc.com

Branch
------
REL_18_STABLE

Details
-------
https://git.postgresql.org/pg/commitdiff/d3ceb20846e40ec6f39bce5659ddc15eadd29167

Modified Files
--------------
src/backend/replication/slot.c | 100 ++++++++++++++++++++++-------------------
1 file changed, 54 insertions(+), 46 deletions(-)

Reply via email to