Re: Race conditions in logical decoding

Antonin Houska Tue, 20 Jan 2026 00:31:00 -0800

Antonin Houska <[email protected]> wrote:

> I'm not sure yet how to fix the problem. I tried to call XactLockTableWait()
> from SnapBuildAddCommittedTxn() (like it happens in SnapBuildWaitSnapshot()),
> but it made at least one regression test (subscription/t/010_truncate.pl)
> stuck - probably a deadlock. I can spend more time on it, but maybe someone
> can come up with a good idea sooner than me.


Attached here is what I consider a possible fix - simply wait for the CLOG
update before building a new snapshot.

Unfortunately I have no idea right now how to test it using the isolation
tester. With the fix, the additional waiting makes the current test
block. (And if a step is added that unblock the session, it will not reliably
catch failure to wait.)

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com

>From 5a6002215fb8ebeaf1dde120e5f6706bca7b62ae Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Mon, 19 Jan 2026 16:07:45 +0100
Subject: [PATCH] Fix race conditions during the setup of logical decoding.

Although it's rather unlikely, it can happen that the snapshot builder
considers transaction committed (according to WAL) before the commit could be
recorded in CLOG. In an extreme case, snapshot can even be created and used in
between. Since both snapshot and CLOG are needed for visibility checks, this
inconsistency can make them work incorrectly.

The typical symptom is that a transaction that the snapshot considers not
running anymore is (per CLOG) considered aborted instead of committed. Thus a
new tuple version can be evaluated as invisible (if xmin is incorrectly
considered aborted) or a deleted tuple version can be evaluated as visible (if
xmax is incorrectly considered aborted).

This patch fixes the problem by checking if all the XIDs that the new snapshot
considers committed are really committed per CLOG. If at least one is not, the
check is repeated after a short delay. However, a single check is sufficient
in almost all cases, so the performance impact should be minimal.
---
 src/backend/replication/logical/snapbuild.c   | 41 +++++++++++++++++++
 .../utils/activity/wait_event_names.txt       |  1 +
 2 files changed, 42 insertions(+)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 9b09dc8eac1..c02d08f3417 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -400,6 +400,47 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->xmin = builder->xmin;
 	snapshot->xmax = builder->xmax;
 
+	/*
+	 * Although it's very unlikely, it's possible that a commit WAL record was
+	 * decoded but CLOG is not aware of the commit yet. Should the CLOG update
+	 * be delayed even more, visibility checks that use this snapshot could
+	 * work incorrectly. Therefore we check the CLOG status here.
+	 */
+	while (true)
+	{
+		bool	found = false;
+
+		for (int i = 0; i < builder->committed.xcnt; i++)
+		{
+			/*
+			 * XXX Is it worth remembering the XIDs that appear to be
+			 * committed per CLOG and skipping them in the next iteration of
+			 * the outer loop? Not sure it's worth the effort - a single
+			 * iteration is enough in most cases.
+			 */
+			if (unlikely(!TransactionIdDidCommit(builder->committed.xip[i])))
+			{
+				found = true;
+
+				/*
+				 * Wait a bit before going to the next iteration of the outer
+				 * loop. The race conditions we address here is pretty rare,
+				 * so we shouldn't need to wait too long.
+				 */
+				(void) WaitLatch(MyLatch,
+								 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+								 10L,
+								 WAIT_EVENT_SNAPBUILD_CLOG);
+				ResetLatch(MyLatch);
+
+				break;
+			}
+		}
+
+		if (!found)
+			break;
+	}
+
 	/* store all transactions to be treated as committed by this snapshot */
 	snapshot->xip =
 		(TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 5537a2d2530..b6318b0cf37 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -181,6 +181,7 @@ PG_SLEEP	"Waiting due to a call to <function>pg_sleep</function> or a sibling fu
 RECOVERY_APPLY_DELAY	"Waiting to apply WAL during recovery because of a delay setting."
 RECOVERY_RETRIEVE_RETRY_INTERVAL	"Waiting during recovery when WAL data is not available from any source (<filename>pg_wal</filename>, archive or stream)."
 REGISTER_SYNC_REQUEST	"Waiting while sending synchronization requests to the checkpointer, because the request queue is full."
+SNAPBUILD_CLOG	"Waiting for CLOG update before building snapshot."
 SPIN_DELAY	"Waiting while acquiring a contended spinlock."
 VACUUM_DELAY	"Waiting in a cost-based vacuum delay point."
 VACUUM_TRUNCATE	"Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
-- 
2.47.3

Re: Race conditions in logical decoding

Reply via email to