On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar <[email protected]> wrote:
>
> Hi All,
> In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve 
> this assert failure. After applying this patch I execute 
> pg_sync_replication_slots() (which calls SyncReplicationSlots → 
> synchronize_slots() → synchronize_one_slot() → 
> ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in 
> ReplicationSlotsComputeRequiredXmin() because the ReplicationSlotControlLock 
> is not held in that code path. By default sync_replication_slots is off, so 
> the background slot-sync worker is not spawned; invoking the UDF directly 
> exercises the path without the lock. I have a small patch that acquires 
> ReplicationSlotControlLock in the manual sync path; that stops the assert.
>
> Call Stack :
> TRAP: failed Assert("!already_locked || 
> (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) && 
> LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot.
> c", Line: 1061, PID: 67056
> 0   postgres                            0x000000010104aad4 
> ExceptionalCondition + 216
> 1   postgres                            0x0000000100d8718c 
> ReplicationSlotsComputeRequiredXmin + 180
> 2   postgres                            0x0000000100d6fba8 
> synchronize_one_slot + 1488
> 3   postgres                            0x0000000100d6e8cc synchronize_slots 
> + 1480
> 4   postgres                            0x0000000100d6efe4 
> SyncReplicationSlots + 164
> 5   postgres                            0x0000000100d8da84 
> pg_sync_replication_slots + 476
> 6   postgres                            0x0000000100b34c58 ExecInterpExpr + 
> 2388
> 7   postgres                            0x0000000100b33ee8 
> ExecInterpExprStillValid + 76
> 8   postgres                            0x00000001008acd5c 
> ExecEvalExprSwitchContext + 64
> 9   postgres                            0x0000000100b54d48 ExecProject + 76
> 10  postgres                            0x0000000100b925d4 ExecResult + 312
> 11  postgres                            0x0000000100b5083c ExecProcNodeFirst 
> + 92
> 12  postgres                            0x0000000100b48b88 ExecProcNode + 60
> 13  postgres                            0x0000000100b44410 ExecutePlan + 184
> 14  postgres                            0x0000000100b442dc 
> standard_ExecutorRun + 644
> 15  postgres                            0x0000000100b44048 ExecutorRun + 104
> 16  postgres                            0x0000000100e3053c PortalRunSelect + 
> 308
> 17  postgres                            0x0000000100e2ff40 PortalRun + 736
> 18  postgres                            0x0000000100e2b21c exec_simple_query 
> + 1368
> 19  postgres                            0x0000000100e2a42c PostgresMain + 2508
> 20  postgres                            0x0000000100e22ce4 BackendInitialize 
> + 0
> 21  postgres                            0x0000000100d1fd4c 
> postmaster_child_launch + 304
> 22  postgres                            0x0000000100d26d9c BackendStartup + 
> 448
> 23  postgres                            0x0000000100d23f18 ServerLoop + 372
> 24  postgres                            0x0000000100d22f18 PostmasterMain + 
> 6396
> 25  postgres                            0x0000000100bcffd4 init_locale + 0
> 26  dyld                                0x0000000186d82b98 start + 6076
>
> The assert is raised inside ReplicationSlotsComputeRequiredXmin() because 
> that function expects either that already_locked is false (and it will 
> acquire what it needs), or that callers already hold both 
> ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the 
> manual-sync path called by the UDF, neither lock is held, so the assertion 
> trips.
>
> Why this happens:
> The background slot sync worker (spawned when sync_replication_slots = on) 
> acquires the necessary locks before calling the routines that update/compute 
> slot xmins, so the worker path is safe.The manual path through the 
> SQL-callable UDF does not take the same locks before calling 
> synchronize_slots()/synchronize_one_slot(). As a result the invariant assumed 
> by ReplicationSlotsComputeRequiredXmin() can be violated, leading to the 
> assert.
>
> Proposed fix:
> In synchronize_slots() (the code path used by 
> SyncReplicationSlots()/pg_sync_replication_slots()), acquire 
> ReplicationSlotControlLock before any call that can end up calling 
> ReplicationSlotsComputeRequiredXmin(true).

It would be great if we have a test case for this issue possibly using
injection points.

Also, I think it's worth considering the idea Robert shared before[1]:

---
But what about just surgically preventing that?
ProcArraySetReplicationSlotXmin() could refuse to retreat the values,
perhaps? If it computes an older value than what's there, it just does
nothing?
---

We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582,
and it sounds reasonable to me that ProcArraySetReplicationSlotXmin()
refuses to retreat the values.

Regards,

[1] 
https://www.postgresql.org/message-id/CA%2BTgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com


Reply via email to