Hi,

I realized that I hadn't posted the patch version I described somewhere
downthread.  Here it is, as 0001.

While thinking about it for posting just now, I wondered if it would
work to consider that any transaction whose commit record has been
decoded but which nevertheless gets a true return from
TransactionIdIsInProgress(), just is not committed yet and so should be
omitted from the xip array of the snapshot.  That is, if it's
still-running, then we just don't copy it into the output snapshot.
That's implemented as 0002 here.  This seems somehow less controversial,
as we don't have to test TransactionIdDidCommit() for a transaction that
we haven't seen as not-running per PGPROC; and it should also be faster,
because we don't have to wait for anybody to commit.  However it gives
me pause that perhaps the snapshot would not be fully correct.  (Indeed
there are a few failing tests in the subscription suite).

Failing other ideas, I think we should just go with 0001.  We'd need more
commentary on why is TransactionIdDidCommit() OK, when we haven't
scanned PGPROC for that xid, though.

Thanks,

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"El que vive para el futuro es un iluso, y el que vive para el pasado,
un imbécil" (Luis Adler, "Los tripulantes de la noche")
>From 6bcc6c35e480ffa02117c1e6591f0bccdc70ad12 Mon Sep 17 00:00:00 2001
From: Antonin Houska <[email protected]>
Date: Mon, 19 Jan 2026 16:07:45 +0100
Subject: [PATCH 1/2] Fix race conditions during the setup of logical decoding.

Although it's rather unlikely, it can happen that the snapshot builder
considers transaction committed (according to WAL) before the commit could be
recorded in CLOG. In an extreme case, snapshot can even be created and used in
between. Since both snapshot and CLOG are needed for visibility checks, this
inconsistency can make them work incorrectly.

The typical symptom is that a transaction that the snapshot considers not
running anymore is (per CLOG) considered aborted instead of committed. Thus a
new tuple version can be evaluated as invisible (if xmin is incorrectly
considered aborted) or a deleted tuple version can be evaluated as visible (if
xmax is incorrectly considered aborted).

This patch fixes the problem by checking if all the XIDs that the new snapshot
considers committed are really committed per CLOG. If at least one is not, the
check is repeated after a short delay. However, a single check is sufficient
in almost all cases, so the performance impact should be minimal.
---
 src/backend/replication/logical/snapbuild.c   | 27 ++++++++++++++++++-
 .../utils/activity/wait_event_names.txt       |  1 +
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 37f0c6028bd..d7ea098cb37 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -377,7 +377,7 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 
 	/*
 	 * We misuse the original meaning of SnapshotData's xip and subxip fields
-	 * to make the more fitting for our needs.
+	 * to make them more fitting for our needs.
 	 *
 	 * In the 'xip' array we store transactions that have to be treated as
 	 * committed. Since we will only ever look at tuples from transactions
@@ -402,6 +402,31 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->xmin = builder->xmin;
 	snapshot->xmax = builder->xmax;
 
+	/*
+	 * Although it's very unlikely, it's possible that a commit WAL record was
+	 * decoded but CLOG is not aware of the commit yet. Should the CLOG update
+	 * be delayed even more, visibility checks that use this snapshot could
+	 * work incorrectly. Therefore we check the CLOG status here.
+	 */
+	for (int i = 0; i < builder->committed.xcnt; i++)
+	{
+		for (;;)
+		{
+			if (TransactionIdDidCommit(builder->committed.xip[i]))
+				break;
+			else
+			{
+				(void) WaitLatch(MyLatch,
+								 WL_LATCH_SET | WL_TIMEOUT |
+								 WL_EXIT_ON_PM_DEATH,
+								 10L,
+								 WAIT_EVENT_SNAPBUILD_CLOG);
+				ResetLatch(MyLatch);
+			}
+			CHECK_FOR_INTERRUPTS();
+		}
+	}
+
 	/* store all transactions to be treated as committed by this snapshot */
 	snapshot->xip =
 		(TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 4aa864fe3c3..987df777e47 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -181,6 +181,7 @@ PG_SLEEP	"Waiting due to a call to <function>pg_sleep</function> or a sibling fu
 RECOVERY_APPLY_DELAY	"Waiting to apply WAL during recovery because of a delay setting."
 RECOVERY_RETRIEVE_RETRY_INTERVAL	"Waiting during recovery when WAL data is not available from any source (<filename>pg_wal</filename>, archive or stream)."
 REGISTER_SYNC_REQUEST	"Waiting while sending synchronization requests to the checkpointer, because the request queue is full."
+SNAPBUILD_CLOG	"Waiting for CLOG update before building snapshot."
 SPIN_DELAY	"Waiting while acquiring a contended spinlock."
 VACUUM_DELAY	"Waiting in a cost-based vacuum delay point."
 VACUUM_TRUNCATE	"Waiting to acquire an exclusive lock to truncate off any empty pages at the end of a table vacuumed."
-- 
2.47.3

>From 54d22ff9cd46b00b3bdd06f3bdd4b22738645eb2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?=C3=81lvaro=20Herrera?= <[email protected]>
Date: Fri, 20 Mar 2026 15:58:05 +0100
Subject: [PATCH 2/2] What about just ignoring the xacts if they're still
 running?

---
 src/backend/replication/logical/snapbuild.c | 37 +++++----------------
 1 file changed, 8 insertions(+), 29 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index d7ea098cb37..2a4581b75a6 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -402,38 +402,17 @@ SnapBuildBuildSnapshot(SnapBuild *builder)
 	snapshot->xmin = builder->xmin;
 	snapshot->xmax = builder->xmax;
 
-	/*
-	 * Although it's very unlikely, it's possible that a commit WAL record was
-	 * decoded but CLOG is not aware of the commit yet. Should the CLOG update
-	 * be delayed even more, visibility checks that use this snapshot could
-	 * work incorrectly. Therefore we check the CLOG status here.
-	 */
-	for (int i = 0; i < builder->committed.xcnt; i++)
-	{
-		for (;;)
-		{
-			if (TransactionIdDidCommit(builder->committed.xip[i]))
-				break;
-			else
-			{
-				(void) WaitLatch(MyLatch,
-								 WL_LATCH_SET | WL_TIMEOUT |
-								 WL_EXIT_ON_PM_DEATH,
-								 10L,
-								 WAIT_EVENT_SNAPBUILD_CLOG);
-				ResetLatch(MyLatch);
-			}
-			CHECK_FOR_INTERRUPTS();
-		}
-	}
-
 	/* store all transactions to be treated as committed by this snapshot */
 	snapshot->xip =
 		(TransactionId *) ((char *) snapshot + sizeof(SnapshotData));
-	snapshot->xcnt = builder->committed.xcnt;
-	memcpy(snapshot->xip,
-		   builder->committed.xip,
-		   builder->committed.xcnt * sizeof(TransactionId));
+
+	for (int i = 0; i < builder->committed.xcnt; i++)
+	{
+		if (!TransactionIdIsInProgress(builder->committed.xip[i]))
+		{
+			snapshot->xip[snapshot->xcnt++] = builder->committed.xip[i];
+		}
+	}
 
 	/* sort so we can bsearch() */
 	qsort(snapshot->xip, snapshot->xcnt, sizeof(TransactionId), xidComparator);
-- 
2.47.3

Reply via email to