On 12/12/2025 15:19, Jasper Smit wrote:
Thanks, for looking into this and for creating the patch.

+1 for that approach. heap_lock_updated_tuple() actually does that check
too, but not for the first tuple.

I think this will for sure fix the problem, however we are still
accessing completely unrelated tuples. It feels unsafe to access
tuples that might have been vacuumed/pruned away. Is it possible for
example that the page we are accessing has been truncated away in the
meantime?

It should be OK, we do it in other places too. For example, heapam_lock_tuple() follows the ctid after the call to heap_lock_tuple(), when called with TUPLE_LOCK_FLAG_FIND_LAST_VERSION.

heap_lock_updated_tuple_rec() handles the case that the tuple is missing altogether, and the xmin == priorXmax check ensures that it's the correct tuple.

Vacuum acquires an AccessExclusiveLock when truncating the relation, which ensures that no backend can be in the middle of following the update chain. That's not great - it would be nice to not need the AccesseExclusiveLock - but we do rely on it in many places currently.

I guess that works too, but the downside is that we can't vacuum aborted
tuples as fast.

I don't know why we need aborted tuples to be vacuumed so much faster
than dead tuples caused by inserts/updates. I would think that
generally you would have much more of those. In our code base we have
chosen for this approach, since this fix also safeguards us for
other potentially problematic cases where the ctid pointer is followed.

Yeah, for most workloads, it's probably not important that aborted tuples are vacuumed quickly. But I could imagine some workloads with lots of rollbacks and long running transactions where it would matter. Making a behavioral change like that in backbranches is a bad idea, when we have a much more localized patch available.

  * The initial tuple is assumed to be already locked.
But AFAICS that's not true. The caller holds the heavy-weight tuple

This does not look true to me either.

I simplified your patch a little bit by extracting common code to the
heap_lock_updated_tuple() function.

Thanks! That's not quite correct though. This is very subtle, but the 'tuple' argument to heap_lock_updated_tuple() points to the buffer holding the original tuple. So doing HeapTupleHeaderGetUpdateXid(tuple->t_data) reads the current xmax of the tuple, which can now be different than what it was earlier, i.e. different from the xwait that we waited for. It's important that the 'ctid' and 'priorXmax' that we use to follow the chain are read together.

I expanded the test to demonstrate what can go wrong with that. If the original tuple was locked or updated concurrently, we try to incorrectly follow the update chain again and get the same assertion failure.


Hmm, now that i look at it, this existing check there looks a little suspicious too:

        /*
         * If the tuple has not been updated, or has moved into another 
partition
         * (effectively a delete) stop here.
         */
        if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
                !ItemPointerEquals(&tuple->t_self, ctid))
        {

It's possible that tuple->t_data->t_ctid is modified concurrently by another backend, while we're doing the HeapTupleHeaderIndicatesMovedPartitions() check. I suppose it's harmless: if the tuple is updated concurrently, depending on the timing we'll either exit here quickly with TM_Ok, or we proceed to follow the ctid, will fail the priorXmax check, and return TM_Ok from there.

It seems sketchy to read the t_ctid without holding a lock on the buffer however. It's not guaranteed to be atomic.

Attached is a new patch version:

* I kept the code to get the updating xid from a multixid in heap_lock_updated_tuple() like you did, but it now correctly uses the original xmax of the tuple, instead of reading it from the buffer.

* Modified that HeapTupleHeaderIndicatesMovedPartitions() call to use the passed-in ctid instead of reading it from the buffer.

* I removed the 'tuple' pointer argument from heap_lock_updated_tuple() altogether. Seems safer that way. The function really shouldn't be accessing tuple->t_data because we're not holding a buffer lock. The "ItemPointerEquals(&tuple->t_self, ctid)" was safe, because tuple->t_self is just local memory, but I moved that to the callers so that I could remove the 'tuple' argument.

* Fixed the comment to not claim that the initial tuple has already been locked.

- Heikki
From e522298608362461c8348ed883633668ad9ff187 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 15 Dec 2025 18:44:23 +0200
Subject: [PATCH v3 1/1] Fix bug in following update chain when locking a heap
 tuple

Author: Jasper Smit <[email protected]>
Discussion: https://www.postgresql.org/message-id/CAOG%2BRQ74x0q%3DkgBBQ%3DmezuvOeZBfSxM1qu_o0V28bwDz3dHxLw%40mail.gmail.com
---
 src/backend/access/heap/heapam.c              | 47 ++++++++----
 src/test/modules/injection_points/Makefile    |  3 +-
 .../expected/heap_lock_update.out             | 73 +++++++++++++++++++
 src/test/modules/injection_points/meson.build |  1 +
 .../specs/heap_lock_update.spec               | 71 ++++++++++++++++++
 5 files changed, 179 insertions(+), 16 deletions(-)
 create mode 100644 src/test/modules/injection_points/expected/heap_lock_update.out
 create mode 100644 src/test/modules/injection_points/specs/heap_lock_update.spec

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 225f9829f22..53b217ea50b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -85,8 +85,11 @@ static void compute_new_xmax_infomask(TransactionId xmax, uint16 old_infomask,
 									  LockTupleMode mode, bool is_update,
 									  TransactionId *result_xmax, uint16 *result_infomask,
 									  uint16 *result_infomask2);
-static TM_Result heap_lock_updated_tuple(Relation rel, HeapTuple tuple,
-										 const ItemPointerData *ctid, TransactionId xid,
+static TM_Result heap_lock_updated_tuple(Relation rel,
+										 uint16 prior_infomask,
+										 TransactionId prior_rawxmax,
+										 const ItemPointerData *prior_ctid,
+										 TransactionId xid,
 										 LockTupleMode mode);
 static void GetMultiXactIdHintBits(MultiXactId multi, uint16 *new_infomask,
 								   uint16 *new_infomask2);
@@ -4816,11 +4819,12 @@ l3:
 				 * If there are updates, follow the update chain; bail out if
 				 * that cannot be done.
 				 */
-				if (follow_updates && updated)
+				if (follow_updates && updated && !ItemPointerEquals(&tuple->t_self, &t_ctid))
 				{
 					TM_Result	res;
 
-					res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+					res = heap_lock_updated_tuple(relation,
+												  infomask, xwait, &t_ctid,
 												  GetCurrentTransactionId(),
 												  mode);
 					if (res != TM_Ok)
@@ -5063,11 +5067,13 @@ l3:
 			}
 
 			/* if there are updates, follow the update chain */
-			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
+			if (follow_updates && !HEAP_XMAX_IS_LOCKED_ONLY(infomask) &&
+				!ItemPointerEquals(&tuple->t_self, &t_ctid))
 			{
 				TM_Result	res;
 
-				res = heap_lock_updated_tuple(relation, tuple, &t_ctid,
+				res = heap_lock_updated_tuple(relation,
+											  infomask, xwait, &t_ctid,
 											  GetCurrentTransactionId(),
 											  mode);
 				if (res != TM_Ok)
@@ -5721,7 +5727,8 @@ test_lockmode_for_conflict(MultiXactStatus status, TransactionId xid,
  * version as well.
  */
 static TM_Result
-heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, TransactionId xid,
+heap_lock_updated_tuple_rec(Relation rel, TransactionId priorXmax,
+							const ItemPointerData *tid, TransactionId xid,
 							LockTupleMode mode)
 {
 	TM_Result	result;
@@ -5734,7 +5741,6 @@ heap_lock_updated_tuple_rec(Relation rel, const ItemPointerData *tid, Transactio
 				old_infomask2;
 	TransactionId xmax,
 				new_xmax;
-	TransactionId priorXmax = InvalidTransactionId;
 	bool		cleared_all_frozen = false;
 	bool		pinned_desired_page;
 	Buffer		vmbuffer = InvalidBuffer;
@@ -6048,7 +6054,10 @@ out_unlocked:
  *		Follow update chain when locking an updated tuple, acquiring locks (row
  *		marks) on the updated versions.
  *
- * The initial tuple is assumed to be already locked.
+ * 'priorInfomask', 'priorRawXmax' and 'priorCtid' are the corresponding
+ * fields from the initial tuple.  We will update the tuples starting from the
+ * one that 'priorCtid' points to.  Locking the initial tuple where those
+ * values came from is the caller's responsibility.
  *
  * This function doesn't check visibility, it just unconditionally marks the
  * tuple(s) as locked.  If any tuple in the updated chain is being deleted
@@ -6066,16 +6075,22 @@ out_unlocked:
  * levels, because that would lead to a serializability failure.
  */
 static TM_Result
-heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ctid,
+heap_lock_updated_tuple(Relation rel,
+						uint16 prior_infomask,
+						TransactionId prior_raw_xmax,
+						const ItemPointerData *prior_ctid,
 						TransactionId xid, LockTupleMode mode)
 {
+	INJECTION_POINT("heap_lock_updated_tuple", NULL);
+
 	/*
-	 * If the tuple has not been updated, or has moved into another partition
-	 * (effectively a delete) stop here.
+	 * If the tuple has moved into another partition (effectively a delete)
+	 * stop here.
 	 */
-	if (!HeapTupleHeaderIndicatesMovedPartitions(tuple->t_data) &&
-		!ItemPointerEquals(&tuple->t_self, ctid))
+	if (!ItemPointerIndicatesMovedPartitions(prior_ctid))
 	{
+		TransactionId prior_xmax;
+
 		/*
 		 * If this is the first possibly-multixact-able operation in the
 		 * current transaction, set my per-backend OldestMemberMXactId
@@ -6087,7 +6102,9 @@ heap_lock_updated_tuple(Relation rel, HeapTuple tuple, const ItemPointerData *ct
 		 */
 		MultiXactIdSetOldestMember();
 
-		return heap_lock_updated_tuple_rec(rel, ctid, xid, mode);
+		prior_xmax = (prior_infomask & HEAP_XMAX_IS_MULTI) ?
+			MultiXactIdGetUpdateXid(prior_raw_xmax, prior_infomask) : prior_raw_xmax;
+		return heap_lock_updated_tuple_rec(rel, prior_xmax, prior_ctid, xid, mode);
 	}
 
 	/* nothing to lock */
diff --git a/src/test/modules/injection_points/Makefile b/src/test/modules/injection_points/Makefile
index bfdb3f53377..cc9be6dcdd2 100644
--- a/src/test/modules/injection_points/Makefile
+++ b/src/test/modules/injection_points/Makefile
@@ -14,7 +14,8 @@ REGRESS_OPTS = --dlpath=$(top_builddir)/src/test/regress
 
 ISOLATION = basic \
 	    inplace \
-	    syscache-update-pruned
+	    syscache-update-pruned \
+	    heap_lock_update
 
 # Temporarily disabled because of flakiness
 #ISOLATION =+
diff --git a/src/test/modules/injection_points/expected/heap_lock_update.out b/src/test/modules/injection_points/expected/heap_lock_update.out
new file mode 100644
index 00000000000..9828e2b151b
--- /dev/null
+++ b/src/test/modules/injection_points/expected/heap_lock_update.out
@@ -0,0 +1,73 @@
+Parsed test spec with 2 sessions
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert: 
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
+
+starting permutation: s1begin s1update s2lock s1abort vacuum reinsert_and_lock wake
+step s1begin: begin;
+step s1update: update t set id = 1000 where id = 1;
+step s2lock: select * from t where id = 1 for update; <waiting ...>
+step s1abort: abort;
+step vacuum: VACUUM (TRUNCATE off);
+step reinsert_and_lock: 
+        begin;
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	select ctid, * from t where id = 1 for update;
+	commit;
+	update t set id = 454 where id = 453 returning ctid;
+
+ctid 
+-----
+(2,1)
+(1 row)
+
+ctid |id
+-----+--
+(0,1)| 1
+(1 row)
+
+ctid 
+-----
+(2,2)
+(1 row)
+
+step wake: 
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+ <waiting ...>
+step s2lock: <... completed>
+id
+--
+ 1
+(1 row)
+
+step wake: <... completed>
diff --git a/src/test/modules/injection_points/meson.build b/src/test/modules/injection_points/meson.build
index 493e11053dc..c9186ae04d1 100644
--- a/src/test/modules/injection_points/meson.build
+++ b/src/test/modules/injection_points/meson.build
@@ -46,6 +46,7 @@ tests += {
       'basic',
       'inplace',
       'syscache-update-pruned',
+      'heap_lock_update',
       # temporarily disabled because of flakiness
       # 'index-concurrently-upsert',
       # 'index-concurrently-upsert-predicate',
diff --git a/src/test/modules/injection_points/specs/heap_lock_update.spec b/src/test/modules/injection_points/specs/heap_lock_update.spec
new file mode 100644
index 00000000000..93887312815
--- /dev/null
+++ b/src/test/modules/injection_points/specs/heap_lock_update.spec
@@ -0,0 +1,71 @@
+# Test race condition in tuple locking
+#
+# This is a reproducer for the bug reported at:
+# https://www.postgresql.org/message-id/CAOG%2BRQ74x0q%3DkgBBQ%3DmezuvOeZBfSxM1qu_o0V28bwDz3dHxLw%40mail.gmail.com
+
+setup
+{
+    CREATE EXTENSION injection_points;
+
+    create table t (id int);
+    insert into t (select generate_series(1, 452));
+}
+teardown
+{
+    drop table t;
+    DROP EXTENSION injection_points;
+}
+
+session s1
+step s1begin	{ begin; }
+step s1update   { update t set id = 1000 where id = 1; }
+step s1abort	{ abort; }
+step vacuum	{ VACUUM (TRUNCATE off); }
+step reinsert   {
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+# Same as 'reinsert', but we also stamp the original tuple with the
+# same 'xmax' as the re-inserted one.
+step reinsert_and_lock {
+        begin;
+	insert into t values (453) returning ctid; -- Should be (2,1)
+	select ctid, * from t where id = 1 for update;
+	commit;
+	update t set id = 454 where id = 453 returning ctid;
+}
+
+step wake	{
+    SELECT FROM injection_points_detach('heap_lock_updated_tuple');
+    SELECT FROM injection_points_wakeup('heap_lock_updated_tuple');
+}
+
+session s2
+setup	{
+	SELECT FROM injection_points_set_local();
+	SELECT FROM injection_points_attach('heap_lock_updated_tuple', 'wait');
+}
+step s2lock	{ select * from t where id = 1 for update; }
+
+# Basic case
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert
+	wake(s2lock)
+
+# Variant where the re-inserted tuple is also locked by the inserting
+# transaction. This failed an earlier version of the fix during
+# development.
+permutation
+	s1begin
+	s1update
+	s2lock
+	s1abort
+	vacuum
+	reinsert_and_lock
+	wake(s2lock)
-- 
2.47.3

Reply via email to