On Mon, Oct 27, 2025 at 5:22 AM Pradeep Kumar <[email protected]> wrote: > > Hi All, > In this thread they proposed fix_concurrent_slot_xmin_update.patch will solve > this assert failure. After applying this patch I execute > pg_sync_replication_slots() (which calls SyncReplicationSlots → > synchronize_slots() → synchronize_one_slot() → > ReplicationSlotsComputeRequiredXmin(true)) can hit an assertion failure in > ReplicationSlotsComputeRequiredXmin() because the ReplicationSlotControlLock > is not held in that code path. By default sync_replication_slots is off, so > the background slot-sync worker is not spawned; invoking the UDF directly > exercises the path without the lock. I have a small patch that acquires > ReplicationSlotControlLock in the manual sync path; that stops the assert. > > Call Stack : > TRAP: failed Assert("!already_locked || > (LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_EXCLUSIVE) && > LWLockHeldByMeInMode(ProcArrayLock, LW_EXCLUSIVE))"), File: "slot. > c", Line: 1061, PID: 67056 > 0 postgres 0x000000010104aad4 > ExceptionalCondition + 216 > 1 postgres 0x0000000100d8718c > ReplicationSlotsComputeRequiredXmin + 180 > 2 postgres 0x0000000100d6fba8 > synchronize_one_slot + 1488 > 3 postgres 0x0000000100d6e8cc synchronize_slots > + 1480 > 4 postgres 0x0000000100d6efe4 > SyncReplicationSlots + 164 > 5 postgres 0x0000000100d8da84 > pg_sync_replication_slots + 476 > 6 postgres 0x0000000100b34c58 ExecInterpExpr + > 2388 > 7 postgres 0x0000000100b33ee8 > ExecInterpExprStillValid + 76 > 8 postgres 0x00000001008acd5c > ExecEvalExprSwitchContext + 64 > 9 postgres 0x0000000100b54d48 ExecProject + 76 > 10 postgres 0x0000000100b925d4 ExecResult + 312 > 11 postgres 0x0000000100b5083c ExecProcNodeFirst > + 92 > 12 postgres 0x0000000100b48b88 ExecProcNode + 60 > 13 postgres 0x0000000100b44410 ExecutePlan + 184 > 14 postgres 0x0000000100b442dc > standard_ExecutorRun + 644 > 15 postgres 0x0000000100b44048 ExecutorRun + 104 > 16 postgres 0x0000000100e3053c PortalRunSelect + > 308 > 17 postgres 0x0000000100e2ff40 PortalRun + 736 > 18 postgres 0x0000000100e2b21c exec_simple_query > + 1368 > 19 postgres 0x0000000100e2a42c PostgresMain + 2508 > 20 postgres 0x0000000100e22ce4 BackendInitialize > + 0 > 21 postgres 0x0000000100d1fd4c > postmaster_child_launch + 304 > 22 postgres 0x0000000100d26d9c BackendStartup + > 448 > 23 postgres 0x0000000100d23f18 ServerLoop + 372 > 24 postgres 0x0000000100d22f18 PostmasterMain + > 6396 > 25 postgres 0x0000000100bcffd4 init_locale + 0 > 26 dyld 0x0000000186d82b98 start + 6076 > > The assert is raised inside ReplicationSlotsComputeRequiredXmin() because > that function expects either that already_locked is false (and it will > acquire what it needs), or that callers already hold both > ReplicationSlotControlLock (exclusive) and ProcArrayLock (exclusive). In the > manual-sync path called by the UDF, neither lock is held, so the assertion > trips. > > Why this happens: > The background slot sync worker (spawned when sync_replication_slots = on) > acquires the necessary locks before calling the routines that update/compute > slot xmins, so the worker path is safe.The manual path through the > SQL-callable UDF does not take the same locks before calling > synchronize_slots()/synchronize_one_slot(). As a result the invariant assumed > by ReplicationSlotsComputeRequiredXmin() can be violated, leading to the > assert. > > Proposed fix: > In synchronize_slots() (the code path used by > SyncReplicationSlots()/pg_sync_replication_slots()), acquire > ReplicationSlotControlLock before any call that can end up calling > ReplicationSlotsComputeRequiredXmin(true).
It would be great if we have a test case for this issue possibly using injection points. Also, I think it's worth considering the idea Robert shared before[1]: --- But what about just surgically preventing that? ProcArraySetReplicationSlotXmin() could refuse to retreat the values, perhaps? If it computes an older value than what's there, it just does nothing? --- We did a similar fix for confirmed_flush LSN by commit ad5eaf390c582, and it sounds reasonable to me that ProcArraySetReplicationSlotXmin() refuses to retreat the values. Regards, [1] https://www.postgresql.org/message-id/CA%2BTgmoYLzJxCEa0aCan3KR7o_25G52cbqw-90Q0VGRmV3a8XGQ%40mail.gmail.com -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
