> On Jun 2, 2026, at 00:44, Masahiko Sawada <[email protected]> wrote:
> 
> On Fri, May 29, 2026 at 7:01 PM Chao Li <[email protected]> wrote:
>> 
>> 
>> 
>>> On May 30, 2026, at 03:24, Masahiko Sawada <[email protected]> wrote:
>>> 
>>> On Thu, May 28, 2026 at 8:39 PM Chao Li <[email protected]> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On May 29, 2026, at 05:53, Masahiko Sawada <[email protected]> wrote:
>>>>> 
>>>>> On Thu, May 28, 2026 at 2:09 AM Chao Li <[email protected]> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> While testing “Toggle logical decoding dynamically based on logical slot 
>>>>>> presence”, I hit an assertion failure with concurrent logical slot 
>>>>>> creation.
>>>>>> 
>>>>>> This is a repo:
>>>>>> 
>>>>>> 1. In session 1, attach the injection point locally and start creating a 
>>>>>> logical slot. The session blocks at logical-decoding-activation:
>>>>>> ```
>>>>>> evantest=# set application_name = 'slot_a';
>>>>>> SET
>>>>>> evantest=# select injection_points_set_local();
>>>>>> injection_points_set_local
>>>>>> ----------------------------
>>>>>> 
>>>>>> (1 row)
>>>>>> evantest=# select injection_points_attach('logical-decoding-activation', 
>>>>>> 'wait');
>>>>>> injection_points_attach
>>>>>> -------------------------
>>>>>> 
>>>>>> (1 row)
>>>>>> evantest=# select pg_create_logical_replication_slot('slot_a', 
>>>>>> 'pgoutput');
>>>>>> ```
>>>>>> 
>>>>>> 2. In session 2, create another logical slot. This succeeds, and 
>>>>>> effective_wal_level becomes logical:
>>>>>> ```
>>>>>> evantest=# select pg_create_logical_replication_slot('slot_b', 
>>>>>> 'pgoutput');
>>>>>> pg_create_logical_replication_slot
>>>>>> ------------------------------------
>>>>>> (slot_b,0/0902E418)
>>>>>> (1 row)
>>>>>> 
>>>>>> evantest=# show effective_wal_level;
>>>>>> effective_wal_level
>>>>>> ---------------------
>>>>>> logical
>>>>>> (1 row)
>>>>>> ```
>>>>>> 
>>>>>> 3. In session 2, cancel session 1 instead of waking it up:
>>>>>> ```
>>>>>> evantest=# select pg_cancel_backend(pid) from pg_stat_activity where 
>>>>>> application_name = 'slot_a';
>>>>>> pg_cancel_backend
>>>>>> -------------------
>>>>>> t
>>>>>> (1 row)
>>>>>> ```
>>>>>> 
>>>>>> Then the server hits this assertion:
>>>>> 
>>>>> Thank you for the report! I've confirmed the problem.
>>>>> 
>>>>>> 
>>>>>> I might be over thinking, but I just feel the safest fix is to make 
>>>>>> EnableLogicalDecoding() serialize. I tried serializing with 
>>>>>> LogicalDecodingControlLock and with a separate lock, but both approaches 
>>>>>> got deadlock around the barrier wait. I ended up with adding an 
>>>>>> activation_in_progress flag in shared memory, protected by 
>>>>>> LogicalDecodingControlLock, with a condition variable to wait for the 
>>>>>> active activation to finish.
>>>>>> 
>>>>>> With this fix, rerunning the repro makes session 2 wait while session 1 
>>>>>> is blocked at the injection point. After canceling session 1 from 
>>>>>> session 3, session 2 continues, creates the slot successfully, and 
>>>>>> effective_wal_level becomes logical.
>>>>> 
>>>>> This serialization idea is very similar to what we tried in older
>>>>> version patches. While I've confirmed that the patch fixes the
>>>>> reported issue, I'm somewhat concerned that it could introduce another
>>>>> race condition as the patch adds a new mechanism.
>>>>> 
>>>>> I think that the complication stems from the fact that
>>>>> abort_logical_decoding_activation() and DisableLogicalDecoding() clear
>>>>> the xlog_logical_info flag in different ways. I guess it would be
>>>>> simpler if we could delegate all the deactivation process including
>>>>> clearing xlog_logical_info flag to DisableLogicalDecoding(). It would
>>>>> allow the system to be in the state of xlog_logical_info == true and
>>>>> logical_decoding_enabled = false, but if we change
>>>>> DisableLogicalDecoding() to handle this case, the deactivation process
>>>>> would become simpler.
>>>>> 
>>>>>> I didn’t include a test in this patch, as I wasn’t sure such a test 
>>>>>> would be desirable. If others think it is worth adding, I can convert 
>>>>>> the repro into a TAP test.
>>>>> 
>>>>> I think we should have the tests.
>>>>> 
>>>>> I've attached the patch for the above idea including the regression
>>>>> tests. Please review it.
>>>>> 
>>>> 
>>>> Yeah, your solution of deferring the actual abort to 
>>>> DisableLogicalDecoding() is better. I have a few comments on the new patch:
>>>> 
>>> 
>>> Thank you for reviewing the patch!
>>> 
>>>> 1
>>>> ```
>>>> +       else
>>>> +       {
>>>> +               /*
>>>> +                * Logical decoding is disabled while xlog_logical_info is 
>>>> true.
>>>> +                * This could happen if an activation is interrupted. 
>>>> Reset only
>>>> +                * xlog_logical_info.
>>>> +                */
>>>> +               LogicalDecodingCtl->xlog_logical_info = false;
>>>> +               LogicalDecodingCtl->logical_decoding_enabled = false;
>>>> +       }
>>>> ```
>>>> 
>>>> Here, we should also clear LogicalDecodingCtl->pending_disable = false;
>>> 
>>> Right.
>>> 
>>>> 
>>>> 2. AKAIK, critical section only works with elog, thus simple shared memory 
>>>> assignments don't have to be inside the critical section. So we can move 
>>>> out LogicalDecodingCtl assignments to outside the critical section, which 
>>>> avoids some duplicate code.
>>> 
>>> While it works, keeping these codes in a critical section is more
>>> consistent with what EnableLogicalDecoding() does. Please review the
>>> patch as I simplified the logic.
>> 
>> The new code looks good.
>> 
>>> 
>>>> 
>>>> 3
>>>> ```
>>>> +       if (!in_recovery && was_enabled)
>>>>               ereport(LOG,
>>>>                               errmsg("logical decoding is disabled because 
>>>> there are no valid logical replication slots"));
>>>> ```
>>>> 
>>>> I think this log message can be emitted after releasing the lock, which 
>>>> makes the locking period a bit shorter. That might delay the log output 
>>>> slightly, but since slot creation is not a frequent operation, I think 
>>>> that should be fine.
>>> 
>>> I'm rather concerned that "disabled" log could be written after
>>> "enabled" log even if the logical decoding gets enabled. I recall
>>> there was the same proposal before and we concluded to write the
>>> deactivation log under the lock. I added the comments so as not to
>>> confuse it.
>>> 
>> 
>> Thanks for adding the comments which helps code reads better understand the 
>> logic.
>> 
>> I’m still not convinced that emitting the log before releasing the lock is 
>> necessary.
>> 
>> Even if there is a race, the disable path would release the lock and then 
>> emit its log message immediately. The racing enable path would still need to 
>> acquire the lock, set the state, write and flush the logical-decoding status 
>> WAL record, release the lock, and only then emit the “enabled” log message. 
>> So the misleading log order seems possible in theory, but unlikely in 
>> practice.
>> 
>> So, this is a trade-off. I think avoiding ereport(LOG) under the LWLock 
>> gains more than loss, but I don’t think this point is critical enough to 
>> argue further.
> 
> Yeah, I think this is a trade-off between the accuracy of writing logs
> in the correct order and the better concurrency on toggling logical
> decoding status. Given that LogicalDecodingControlLock is unlikely to
> be contended in practice, I guess doing ereport(LOG) under the LWLock
> is acceptable in this case.
> 
> I think this part is not related to the reported bug and we can change
> it in a separate patch later if we find out that the contention of
> LogicalDecodingControlLock becomes contended.
> 
>> V3 overall looks good to me. A couple of small comments:
>> 
>> 1
>> ```
>> +        * Logging under the lock guarantees our "is disabled" message 
>> appear in
>> ```
>> 
>> Here, “appear” maybe better to be “appears”.
> 
> Fixed.
> 
>> 
>> 2
>> ```
>> +       # Canceling the backend should not affect the concurrent slot 
>> creation.
>> +       $primary->wait_for_log("canceling statement due to user request”);
>> ```
>> 
>> While the final review, I just noticed the previous test cases also have 
>> pg_cancel_backend(pid), so we may better to get the log offset otherwise 
>> wait_for_log may get the log from a previous cancellation, like this:
>> ```
>> diff --git a/src/test/recovery/t/051_effective_wal_level.pl 
>> b/src/test/recovery/t/051_effective_wal_level.pl
>> index e3d8b97c173..0e75b1536a4 100644
>> --- a/src/test/recovery/t/051_effective_wal_level.pl
>> +++ b/src/test/recovery/t/051_effective_wal_level.pl
>> @@ -440,13 +440,14 @@ select 
>> pg_create_logical_replication_slot('slot_canceled2', 'pgoutput');
>> 
>>        # Cancel the backend initiated by $psql_create_slot, aborting its 
>> activation
>>        # process.
>> +       my $log_offset = -s $primary->logfile;
>>        $primary->safe_psql(
>>                'postgres',
>>                qq[
>> select pg_cancel_backend(pid) from pg_stat_activity where query ~ 
>> 'slot_canceled2' and pid <> pg_backend_pid()
>> ]);
>>        # Canceling the backend should not affect the concurrent slot 
>> creation.
>> -       $primary->wait_for_log("canceling statement due to user request");
>> +       $primary->wait_for_log("canceling statement due to user request", 
>> $log_offset);
>>        test_wal_level($primary, "replica|logical",
>>                                   "effective_wal_level remains 'logical' 
>> even after the concurrent activation is interrupted");
>> }
>> ```
> 
> Good catch. Fixed.
> 
> I've attached the updated patch.
> 
> FYI I've added this issue to the open item for the record.
> 
> Regards,
> 
> --
> Masahiko Sawada
> Amazon Web Services: https://aws.amazon.com
> <v4-0001-Fix-race-when-logical-decoding-activation-is-conc.patch>

Thanks for updating the patch. I’m good with v4.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/






Reply via email to