Hi,

While testing “Toggle logical decoding dynamically based on logical slot 
presence”, I hit an assertion failure with concurrent logical slot creation.

This is a repo:

1. In session 1, attach the injection point locally and start creating a 
logical slot. The session blocks at logical-decoding-activation:
```
evantest=# set application_name = 'slot_a';
SET
evantest=# select injection_points_set_local();
 injection_points_set_local
----------------------------

(1 row)
evantest=# select injection_points_attach('logical-decoding-activation', 
'wait');
 injection_points_attach
-------------------------

(1 row)
evantest=# select pg_create_logical_replication_slot('slot_a', 'pgoutput');
``` 

2. In session 2, create another logical slot. This succeeds, and 
effective_wal_level becomes logical:
```
evantest=# select pg_create_logical_replication_slot('slot_b', 'pgoutput');
 pg_create_logical_replication_slot
------------------------------------
 (slot_b,0/0902E418)
(1 row)

evantest=# show effective_wal_level;
 effective_wal_level
---------------------
 logical
(1 row)
```

3. In session 2, cancel session 1 instead of waking it up:
```
evantest=# select pg_cancel_backend(pid) from pg_stat_activity where 
application_name = 'slot_a';
 pg_cancel_backend
-------------------
 t
(1 row)
```

Then the server hits this assertion:
```
TRAP: failed Assert("!LogicalDecodingCtl->logical_decoding_enabled"), File: 
"logicalctl.c", Line: 266, PID: 13768
0   postgres                            0x00000001032b35d8 ExceptionalCondition 
+ 216
1   postgres                            0x0000000102f64600 
abort_logical_decoding_activation + 120
2   postgres                            0x0000000102f6451c 
EnsureLogicalDecodingEnabled + 412
3   postgres                            0x0000000102f9f314 
create_logical_replication_slot + 164
4   postgres                            0x0000000102f9f1c4 
pg_create_logical_replication_slot + 312
5   postgres                            0x0000000102ce5f48 ExecInterpExpr + 3888
6   postgres                            0x0000000102ce48b4 
ExecInterpExprStillValid + 76
7   postgres                            0x0000000102d57e94 ExecEvalExprNoReturn 
+ 44
8   postgres                            0x0000000102d57e54 
ExecEvalExprNoReturnSwitchContext + 48
9   postgres                            0x0000000102d57d18 ExecProject + 72
10  postgres                            0x0000000102d57a9c ExecResult + 312
11  postgres                            0x0000000102d06f1c ExecProcNodeFirst + 
92
12  postgres                            0x0000000102cfd8cc ExecProcNode + 60
13  postgres                            0x0000000102cf83fc ExecutePlan + 244
14  postgres                            0x0000000102cf8298 standard_ExecutorRun 
+ 456
15  postgres                            0x0000000102cf80c0 ExecutorRun + 84
16  postgres                            0x000000010306fc64 PortalRunSelect + 296
17  postgres                            0x000000010306f674 PortalRun + 656
18  postgres                            0x000000010306a220 exec_simple_query + 
1372
19  postgres                            0x0000000103069348 PostgresMain + 3224
20  postgres                            0x0000000103060a3c BackendInitialize + 0
21  postgres                            0x0000000102f27db8 
postmaster_child_launch + 464
22  postgres                            0x0000000102f2f2ec BackendStartup + 304
23  postgres                            0x0000000102f2d260 ServerLoop + 372
24  postgres                            0x0000000102f2bd8c PostmasterMain + 6256
25  postgres                            0x0000000102d99e84 main + 924
26  dyld                                0x000000018cef7e00 start + 6992
2026-05-28 13:28:32.526 CST [13753] LOG:  client backend (PID 13768) was 
terminated by signal 6: Abort trap: 6
2026-05-28 13:28:32.526 CST [13753] DETAIL:  Failed process was running: select 
pg_create_logical_replication_slot('slot_a', 'pgoutput');
```

From my tracing, when session 1 is cancelled, session 1 entered 
abort_logical_decoding_activation(), and there is an assert:
```
Assert(!LogicalDecodingCtl->logical_decoding_enabled);
```

But session 2 had successfully created a slot and set 
LogicalDecodingCtl->logical_decoding_enabled to true, so this is a race 
condition.

I might be over thinking, but I just feel the safest fix is to make 
EnableLogicalDecoding() serialize. I tried serializing with 
LogicalDecodingControlLock and with a separate lock, but both approaches got 
deadlock around the barrier wait. I ended up with adding an 
activation_in_progress flag in shared memory, protected by 
LogicalDecodingControlLock, with a condition variable to wait for the active 
activation to finish.

With this fix, rerunning the repro makes session 2 wait while session 1 is 
blocked at the injection point. After canceling session 1 from session 3, 
session 2 continues, creates the slot successfully, and effective_wal_level 
becomes logical.

I didn’t include a test in this patch, as I wasn’t sure such a test would be 
desirable. If others think it is worth adding, I can convert the repro into a 
TAP test.

See the attached patch for details.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/




Attachment: v1-0001-Fix-race-during-concurrent-logical-decoding-activ.patch
Description: Binary data

Reply via email to