On Fri, May 29, 2026 at 7:01 PM Chao Li <[email protected]> wrote:
>
>
>
> > On May 30, 2026, at 03:24, Masahiko Sawada <[email protected]> wrote:
> >
> > On Thu, May 28, 2026 at 8:39 PM Chao Li <[email protected]> wrote:
> >>
> >>
> >>
> >>> On May 29, 2026, at 05:53, Masahiko Sawada <[email protected]> wrote:
> >>>
> >>> On Thu, May 28, 2026 at 2:09 AM Chao Li <[email protected]> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> While testing “Toggle logical decoding dynamically based on logical slot 
> >>>> presence”, I hit an assertion failure with concurrent logical slot 
> >>>> creation.
> >>>>
> >>>> This is a repo:
> >>>>
> >>>> 1. In session 1, attach the injection point locally and start creating a 
> >>>> logical slot. The session blocks at logical-decoding-activation:
> >>>> ```
> >>>> evantest=# set application_name = 'slot_a';
> >>>> SET
> >>>> evantest=# select injection_points_set_local();
> >>>> injection_points_set_local
> >>>> ----------------------------
> >>>>
> >>>> (1 row)
> >>>> evantest=# select injection_points_attach('logical-decoding-activation', 
> >>>> 'wait');
> >>>> injection_points_attach
> >>>> -------------------------
> >>>>
> >>>> (1 row)
> >>>> evantest=# select pg_create_logical_replication_slot('slot_a', 
> >>>> 'pgoutput');
> >>>> ```
> >>>>
> >>>> 2. In session 2, create another logical slot. This succeeds, and 
> >>>> effective_wal_level becomes logical:
> >>>> ```
> >>>> evantest=# select pg_create_logical_replication_slot('slot_b', 
> >>>> 'pgoutput');
> >>>> pg_create_logical_replication_slot
> >>>> ------------------------------------
> >>>> (slot_b,0/0902E418)
> >>>> (1 row)
> >>>>
> >>>> evantest=# show effective_wal_level;
> >>>> effective_wal_level
> >>>> ---------------------
> >>>> logical
> >>>> (1 row)
> >>>> ```
> >>>>
> >>>> 3. In session 2, cancel session 1 instead of waking it up:
> >>>> ```
> >>>> evantest=# select pg_cancel_backend(pid) from pg_stat_activity where 
> >>>> application_name = 'slot_a';
> >>>> pg_cancel_backend
> >>>> -------------------
> >>>> t
> >>>> (1 row)
> >>>> ```
> >>>>
> >>>> Then the server hits this assertion:
> >>>
> >>> Thank you for the report! I've confirmed the problem.
> >>>
> >>>>
> >>>> I might be over thinking, but I just feel the safest fix is to make 
> >>>> EnableLogicalDecoding() serialize. I tried serializing with 
> >>>> LogicalDecodingControlLock and with a separate lock, but both approaches 
> >>>> got deadlock around the barrier wait. I ended up with adding an 
> >>>> activation_in_progress flag in shared memory, protected by 
> >>>> LogicalDecodingControlLock, with a condition variable to wait for the 
> >>>> active activation to finish.
> >>>>
> >>>> With this fix, rerunning the repro makes session 2 wait while session 1 
> >>>> is blocked at the injection point. After canceling session 1 from 
> >>>> session 3, session 2 continues, creates the slot successfully, and 
> >>>> effective_wal_level becomes logical.
> >>>
> >>> This serialization idea is very similar to what we tried in older
> >>> version patches. While I've confirmed that the patch fixes the
> >>> reported issue, I'm somewhat concerned that it could introduce another
> >>> race condition as the patch adds a new mechanism.
> >>>
> >>> I think that the complication stems from the fact that
> >>> abort_logical_decoding_activation() and DisableLogicalDecoding() clear
> >>> the xlog_logical_info flag in different ways. I guess it would be
> >>> simpler if we could delegate all the deactivation process including
> >>> clearing xlog_logical_info flag to DisableLogicalDecoding(). It would
> >>> allow the system to be in the state of xlog_logical_info == true and
> >>> logical_decoding_enabled = false, but if we change
> >>> DisableLogicalDecoding() to handle this case, the deactivation process
> >>> would become simpler.
> >>>
> >>>> I didn’t include a test in this patch, as I wasn’t sure such a test 
> >>>> would be desirable. If others think it is worth adding, I can convert 
> >>>> the repro into a TAP test.
> >>>
> >>> I think we should have the tests.
> >>>
> >>> I've attached the patch for the above idea including the regression
> >>> tests. Please review it.
> >>>
> >>
> >> Yeah, your solution of deferring the actual abort to 
> >> DisableLogicalDecoding() is better. I have a few comments on the new patch:
> >>
> >
> > Thank you for reviewing the patch!
> >
> >> 1
> >> ```
> >> +       else
> >> +       {
> >> +               /*
> >> +                * Logical decoding is disabled while xlog_logical_info is 
> >> true.
> >> +                * This could happen if an activation is interrupted. 
> >> Reset only
> >> +                * xlog_logical_info.
> >> +                */
> >> +               LogicalDecodingCtl->xlog_logical_info = false;
> >> +               LogicalDecodingCtl->logical_decoding_enabled = false;
> >> +       }
> >> ```
> >>
> >> Here, we should also clear LogicalDecodingCtl->pending_disable = false;
> >
> > Right.
> >
> >>
> >> 2. AKAIK, critical section only works with elog, thus simple shared memory 
> >> assignments don't have to be inside the critical section. So we can move 
> >> out LogicalDecodingCtl assignments to outside the critical section, which 
> >> avoids some duplicate code.
> >
> > While it works, keeping these codes in a critical section is more
> > consistent with what EnableLogicalDecoding() does. Please review the
> > patch as I simplified the logic.
>
> The new code looks good.
>
> >
> >>
> >> 3
> >> ```
> >> +       if (!in_recovery && was_enabled)
> >>                ereport(LOG,
> >>                                errmsg("logical decoding is disabled 
> >> because there are no valid logical replication slots"));
> >> ```
> >>
> >> I think this log message can be emitted after releasing the lock, which 
> >> makes the locking period a bit shorter. That might delay the log output 
> >> slightly, but since slot creation is not a frequent operation, I think 
> >> that should be fine.
> >
> > I'm rather concerned that "disabled" log could be written after
> > "enabled" log even if the logical decoding gets enabled. I recall
> > there was the same proposal before and we concluded to write the
> > deactivation log under the lock. I added the comments so as not to
> > confuse it.
> >
>
> Thanks for adding the comments which helps code reads better understand the 
> logic.
>
> I’m still not convinced that emitting the log before releasing the lock is 
> necessary.
>
> Even if there is a race, the disable path would release the lock and then 
> emit its log message immediately. The racing enable path would still need to 
> acquire the lock, set the state, write and flush the logical-decoding status 
> WAL record, release the lock, and only then emit the “enabled” log message. 
> So the misleading log order seems possible in theory, but unlikely in 
> practice.
>
> So, this is a trade-off. I think avoiding ereport(LOG) under the LWLock gains 
> more than loss, but I don’t think this point is critical enough to argue 
> further.

Yeah, I think this is a trade-off between the accuracy of writing logs
in the correct order and the better concurrency on toggling logical
decoding status. Given that LogicalDecodingControlLock is unlikely to
be contended in practice, I guess doing ereport(LOG) under the LWLock
is acceptable in this case.

I think this part is not related to the reported bug and we can change
it in a separate patch later if we find out that the contention of
LogicalDecodingControlLock becomes contended.

> V3 overall looks good to me. A couple of small comments:
>
> 1
> ```
> +        * Logging under the lock guarantees our "is disabled" message appear 
> in
> ```
>
> Here, “appear” maybe better to be “appears”.

Fixed.

>
> 2
> ```
> +       # Canceling the backend should not affect the concurrent slot 
> creation.
> +       $primary->wait_for_log("canceling statement due to user request”);
> ```
>
> While the final review, I just noticed the previous test cases also have 
> pg_cancel_backend(pid), so we may better to get the log offset otherwise 
> wait_for_log may get the log from a previous cancellation, like this:
> ```
> diff --git a/src/test/recovery/t/051_effective_wal_level.pl 
> b/src/test/recovery/t/051_effective_wal_level.pl
> index e3d8b97c173..0e75b1536a4 100644
> --- a/src/test/recovery/t/051_effective_wal_level.pl
> +++ b/src/test/recovery/t/051_effective_wal_level.pl
> @@ -440,13 +440,14 @@ select 
> pg_create_logical_replication_slot('slot_canceled2', 'pgoutput');
>
>         # Cancel the backend initiated by $psql_create_slot, aborting its 
> activation
>         # process.
> +       my $log_offset = -s $primary->logfile;
>         $primary->safe_psql(
>                 'postgres',
>                 qq[
>  select pg_cancel_backend(pid) from pg_stat_activity where query ~ 
> 'slot_canceled2' and pid <> pg_backend_pid()
>  ]);
>         # Canceling the backend should not affect the concurrent slot 
> creation.
> -       $primary->wait_for_log("canceling statement due to user request");
> +       $primary->wait_for_log("canceling statement due to user request", 
> $log_offset);
>         test_wal_level($primary, "replica|logical",
>                                    "effective_wal_level remains 'logical' 
> even after the concurrent activation is interrupted");
>  }
> ```

Good catch. Fixed.

I've attached the updated patch.

FYI I've added this issue to the open item for the record.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From f6bb89d2fb0596c43cca0ed658d77c7a944603fa Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Fri, 29 May 2026 11:22:13 -0700
Subject: [PATCH v4] Fix race when logical decoding activation is concurrently
 interrupted.

EnableLogicalDecoding() sets xlog_logical_info to true, emits a
procsignal barrier, sets logical_decoding_enabled to true, and then
writes a WAL record. If the activating backend is interrupted between
these steps, a PG_ENSURE_ERROR_CLEANUP() callback runs to undo the
partial activation.

The previous callback asserted that logical_decoding_enabled was still
false and then cleared xlog_logical_info. Both actions were unsafe
when a second backend was concurrently activating: the peer backend
might have already observed xlog_logical_info as true, set
logical_decoding_enabled to true, and written the activation WAL
record before our callback fired, causing the first backend to hit the
assertion failure.

Fix this by having the abort callback call
RequestDisableLogicalDecoding(), allowing the checkpointer to undo the
partial activation in the same manner as a normal deactivation. This
simplifies the logic by unifying the activation abort and deactivation
paths. While this approach now wakes up the checkpointer when an
activation is interrupted, this should not be a serious issue in
practice since such interruptions are rare.

Add a test case to 051_effective_wal_level.pl.

Reported-by: Chao Li <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/replication/logical/logicalctl.c  | 59 +++++++++++--------
 .../recovery/t/051_effective_wal_level.pl     | 45 +++++++++++++-
 2 files changed, 76 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/logical/logicalctl.c b/src/backend/replication/logical/logicalctl.c
index 72f68ec58ef..c487db8a9e7 100644
--- a/src/backend/replication/logical/logicalctl.c
+++ b/src/backend/replication/logical/logicalctl.c
@@ -256,33 +256,20 @@ write_logical_decoding_status_update_record(bool status)
 }
 
 /*
- * A PG_ENSURE_ERROR_CLEANUP callback for activating logical decoding, resetting
- * the shared flags to revert the logical decoding activation process.
+ * A PG_ENSURE_ERROR_CLEANUP callback for activating logical decoding.
+ *
+ * Rather than directly reverting xlog_logical_info here, we request
+ * that the checkpointer handle it via the normal disable path. This
+ * avoids race conditions when multiple backends attempt concurrent
+ * activation: the checkpointer will only reset xlog_logical_info when
+ * no valid logical slots exist, which naturally protects any other
+ * in-progress activation.
  */
 static void
 abort_logical_decoding_activation(int code, Datum arg)
 {
-	Assert(MyReplicationSlot);
-	Assert(!LogicalDecodingCtl->logical_decoding_enabled);
-
 	elog(DEBUG1, "aborting logical decoding activation process");
-
-	/*
-	 * Abort the change to xlog_logical_info. We don't need to check
-	 * CheckLogicalSlotExists() as we're still holding a logical slot.
-	 */
-	LWLockAcquire(LogicalDecodingControlLock, LW_EXCLUSIVE);
-	LogicalDecodingCtl->xlog_logical_info = false;
-	LWLockRelease(LogicalDecodingControlLock);
-
-	/*
-	 * Some processes might have already started logical info WAL logging, so
-	 * tell all running processes to update their caches. We don't need to
-	 * wait for all processes to disable xlog_logical_info locally as it's
-	 * always safe to write logical information to WAL records, even when not
-	 * strictly required.
-	 */
-	EmitProcSignalBarrier(PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO);
+	RequestDisableLogicalDecoding();
 }
 
 /*
@@ -416,6 +403,12 @@ EnableLogicalDecoding(void)
 
 	END_CRIT_SECTION();
 
+	/*
+	 * We log the activation message after releasing the slot lock. This is
+	 * safe because the activation is performed while holding a logical slot,
+	 * meaning, a concurrent deactivation cannot interleave its log message
+	 * ahead of ours.
+	 */
 	if (!in_recovery)
 		ereport(LOG,
 				errmsg("logical decoding is enabled upon creating a new logical replication slot"));
@@ -489,16 +482,19 @@ void
 DisableLogicalDecoding(void)
 {
 	bool		in_recovery = RecoveryInProgress();
+	bool		was_enabled;
 
 	LWLockAcquire(LogicalDecodingControlLock, LW_EXCLUSIVE);
 
 	/*
 	 * Check if we can disable logical decoding.
 	 *
-	 * Skip CheckLogicalSlotExists() check during recovery because the
-	 * existing slots will be invalidated after disabling logical decoding.
+	 * Nothing to do if both flags are already off, or if valid slots exist
+	 * (skip the slot check during recovery because the existing slots will be
+	 * invalidated after disabling logical decoding.)
 	 */
-	if (!LogicalDecodingCtl->logical_decoding_enabled ||
+	if ((!LogicalDecodingCtl->logical_decoding_enabled &&
+		 !LogicalDecodingCtl->xlog_logical_info) ||
 		(!in_recovery && CheckLogicalSlotExists()))
 	{
 		LogicalDecodingCtl->pending_disable = false;
@@ -506,6 +502,13 @@ DisableLogicalDecoding(void)
 		return;
 	}
 
+	/*
+	 * Remember if logical decoding was enabled for logging later. If an
+	 * activation is interrupted, logical decoding is not enabled while
+	 * xlog_logical_info is true.
+	 */
+	was_enabled = LogicalDecodingCtl->logical_decoding_enabled;
+
 	START_CRIT_SECTION();
 
 	/*
@@ -525,7 +528,11 @@ DisableLogicalDecoding(void)
 
 	END_CRIT_SECTION();
 
-	if (!in_recovery)
+	/*
+	 * Logging under the lock guarantees our "is disabled" message appears in
+	 * the server log before its eventual "is enabled".
+	 */
+	if (!in_recovery && was_enabled)
 		ereport(LOG,
 				errmsg("logical decoding is disabled because there are no valid logical replication slots"));
 
diff --git a/src/test/recovery/t/051_effective_wal_level.pl b/src/test/recovery/t/051_effective_wal_level.pl
index c862073c34e..c45eddc7383 100644
--- a/src/test/recovery/t/051_effective_wal_level.pl
+++ b/src/test/recovery/t/051_effective_wal_level.pl
@@ -410,8 +410,49 @@ select pg_cancel_backend(pid) from pg_stat_activity where query ~ 'slot_canceled
 
 	# Verify that the backend aborted the activation process.
 	$primary->wait_for_log("aborting logical decoding activation process");
-	test_wal_level($primary, "replica|replica",
-		"the activation process aborted");
+	wait_for_logical_decoding_disabled($primary);
+	pass("the activation process aborted");
+
+	# Test concurrent activation processes run and one is interrupted.
+	$psql_create_slot = $primary->background_psql('postgres');
+
+	# Start a psql session and stops in the middle of the activation
+	# process.
+	$psql_create_slot->query_until(
+		qr/create_slot_canceled/,
+		q(\echo create_slot_canceled
+select injection_points_set_local();
+select injection_points_attach('logical-decoding-activation', 'wait');
+select pg_create_logical_replication_slot('slot_canceled2', 'pgoutput');
+\q
+));
+	$primary->wait_for_event('client backend', 'logical-decoding-activation');
+	note("injection_point 'logical-decoding-activation' is reached");
+
+	# Another backend concurrently enables logical decoding.
+	$primary->safe_psql('postgres',
+		qq[select pg_create_logical_replication_slot('test_slot2', 'pgoutput')]
+	);
+
+	# Concurrent slot creation should not be blocked. So wait until
+	# test_slot2 is created and logical decoding is enabled.
+	test_wal_level($primary, "replica|logical",
+		"the concurrent activation has done properly");
+
+	# Cancel the backend initiated by $psql_create_slot, aborting its activation
+	# process.
+	my $log_offset = -s $primary->logfile;
+	$primary->safe_psql(
+		'postgres',
+		qq[
+select pg_cancel_backend(pid) from pg_stat_activity where query ~ 'slot_canceled2' and pid <> pg_backend_pid()
+]);
+	# Canceling the backend should not affect the concurrent slot creation.
+	$primary->wait_for_log("canceling statement due to user request",
+		$log_offset);
+	test_wal_level($primary, "replica|logical",
+		"effective_wal_level remains 'logical' even after the concurrent activation is interrupted"
+	);
 }
 
 $primary->stop;
-- 
2.54.0

Reply via email to