Re: IPC/MultixactCreation on the Standby server

Heikki Linnakangas Mon, 01 Dec 2025 04:40:39 -0800

On 30/11/2025 14:15, Andrey Borodin wrote:

On 29 Nov 2025, at 00:51, Heikki Linnakangas <[email protected]> wrote:

I didn't understand why the 'kill9' and 'poll_start' stuff is
needed. We have plenty of tests that kill the server with regular
"$node->stop('immediate')", and restart the server normally. The
checkpoint in the middle of the tests seems unnecessary too. I
removed all that, and the test still seems to work. Was there a
particular reason for them?


In current shutdown sequence test seems to be reproducing corruption
without checkpointing. I recollect that in July standby deadlock was
reachable without checkpoint, but corruption was not. But now it
seems test is working.

Ok.

I moved the wraparound test to a separate test file and commit.
More test coverage is good, but it's quite separate from the
bugfix and the wraparound related test shares very little with the
other test. The wraparound test needs a little more cleanup: use
plain perl instead of 'dd' and 'rm' for the file operations, for
example. (I did that with the tests in the 64-bit mxoff patches,
so we could copy from there.)


PFA test version without dd and rm.

Thanks! I will focus on the main patch and TAP test now, but will committhe wraparound test separately afterwards. At quick glance, it looksgood now.

Did I get your right, that we do not backport wraparound test,
backport fixes for 001_multixact.pl test down to 17 where it
appeared?

Yes, that's my plan. Except that 001_multixact.pl appeared in v18, not v17.

First two patches are v13 intact, second pair is my suggestions.

Thanks, here's a new set of patches, now with backpatched versions forall the branches. As you said, there were a number of differencesbetween branches:

- On master, don't include the compatibility hacks for reading WALgenerated with older minor versions. Because WAL is not compatibleacross major versions anyway.

- REL_18_STABLE didn't have the SimpleLruZeroAndWritePage() function(introduced in commit c616785516).

- REL_17_STABLE didn't have the 001_multixact.pl TAP test. So I didn'tbackport the new TAP test to v17 and below either.

- REL_16_STABLE used 32-bit SLRU page numbers, didn't have bank locks,and used a simple sleep-loop instead of the condition variable.


- REL_15_STABLE and REL_14_STABLE: no conflicts from REL_16_STABLE

All of those conflicts were pretty straightforward to handle, but it'senough code churn for silly mistakes to slip in, especially when the TAPtest didn't apply. So if you have a chance, please help to review andtest each of these backpatched versions too.

In addition to the backpatching, I did some more cosmetic cleanups tothe TAP test.


- Heikki

From b67d9d4f87395a1eda7d6544c223f9f930026e7c Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 1 Dec 2025 14:11:10 +0200
Subject: [PATCH v15-master 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <[email protected]>
Reviewed-by: Dmitry Yurichev <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Ivan Bykov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c        | 177 +++++++++---------
 src/include/access/xlog_internal.h            |   2 +-
 src/test/modules/test_slru/t/001_multixact.pl | 116 +++---------
 src/test/modules/test_slru/test_multixact.c   |   5 +-
 4 files changed, 124 insertions(+), 176 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9d5f130af7e..0f01390c153 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -79,7 +79,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -271,12 +270,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -912,13 +905,33 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -933,22 +946,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1138,8 +1179,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1300,11 +1344,8 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	int			truelength;
 	MultiXactId oldestMXact;
 	MultiXactId nextMXact;
-	MultiXactId tmpMXact;
-	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1351,14 +1392,12 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * error.
 	 *
 	 * Shared lock is enough here since we aren't modifying any global state.
-	 * Acquire it just long enough to grab the current counter values.  We may
-	 * need both nextMXact and nextOffset; see below.
+	 * Acquire it just long enough to grab the current counter values.
 	 */
 	LWLockAcquire(MultiXactGenLock, LW_SHARED);
 
 	oldestMXact = MultiXactState->oldestMultiXactId;
 	nextMXact = MultiXactState->nextMXact;
-	nextOffset = MultiXactState->nextOffset;
 
 	LWLockRelease(MultiXactGenLock);
 
@@ -1378,38 +1417,17 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * Find out the offset at which we need to start reading MultiXactMembers
 	 * and the number of members in the multixact.  We determine the latter as
 	 * the difference between this multixact's starting offset and the next
-	 * one's.  However, there are some corner cases to worry about:
-	 *
-	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * one's.  However, there is one corner case to worry about:
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * Because GetNewMultiXactId skips over offset zero, to reserve zero for
+	 * to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
 	 * member slot wasn't filled, it'll contain zero, and zero isn't a valid
 	 * transaction ID so it can't be a multixact member.  Therefore, if we
 	 * read a zero from the members array, just ignore it.
-	 *
-	 * This is all pretty messy, but the mess occurs only in infrequent corner
-	 * cases, so it seems better than holding the MultiXactGenLock for a long
-	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1417,6 +1435,7 @@ retry:
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
+	/* read this multi's offset */
 	slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, multi);
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
@@ -1424,22 +1443,13 @@ retry:
 
 	Assert(offset != 0);
 
-	/*
-	 * Use the same increment rule as GetNewMultiXactId(), that is, don't
-	 * handle wraparound explicitly until needed.
-	 */
-	tmpMXact = multi + 1;
-
-	if (nextMXact == tmpMXact)
-	{
-		/* Corner case 1: there is no next multixact */
-		length = nextOffset - offset;
-	}
-	else
+	/* read next multi's offset */
 	{
+		MultiXactId tmpMXact;
 		MultiXactOffset nextMXOffset;
 
 		/* handle wraparound if needed */
+		tmpMXact = multi + 1;
 		if (tmpMXact < FirstMultiXactId)
 			tmpMXact = FirstMultiXactId;
 
@@ -1472,18 +1482,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1491,12 +1493,7 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
+	/* read the members */
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1539,7 +1536,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1986,7 +1983,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2132,26 +2128,35 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but let's be sure the next page exists, if the nextMulti was
+	 * reset with pg_resetwal, for example.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 34deb2fe5f0..3be5ecf336f 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -31,7 +31,7 @@
 /*
  * Each page of XLOG file has a header like this:
  */
-#define XLOG_PAGE_MAGIC 0xD119	/* can be used as WAL version indicator */
+#define XLOG_PAGE_MAGIC 0xD11A	/* can be used as WAL version indicator */
 
 typedef struct XLogPageHeaderData
 {
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..7837eb810f0 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,10 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
+# Test multixid corner cases.
 
 use strict;
 use warnings FATAL => 'all';
@@ -19,9 +15,7 @@ if ($ENV{enable_injection_points} ne 'yes')
 	plan skip_all => 'Injection points not supported by this build';
 }
 
-my ($node, $result);
-
-$node = PostgreSQL::Test::Cluster->new('mike');
+my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
@@ -29,95 +23,47 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# This test creates three multixacts. The middle one is never
+# WAL-logged or recorded on the offsets page, because we pause the
+# backend and crash the server before that. After restart, verify that
+# the other multixacts are readable, despite the middle one being
+# lost.
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# Create the first multixact
+my $bg_psql = $node->background_psql('postgres');
+my $multi1 = $bg_psql->query_safe(q(SELECT test_create_multixact();));
+
+# Assign the middle multixact. Use an injection point to prevent it
+# from being fully recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/assigning lost multi/, q(
+\echo assigning lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+# Create the third multixid
+my $multi2 = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->stop('immediate');
+$node->start;
+$bg_psql->{run}->finish;
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# Verify that the recorded multixids are readable
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi1');}),
+	'',
+	'first recorded multi is readable');
 
-$node->stop;
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi2');}),
+	'',
+	'second recorded multi is readable');
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..8fb6c19d70f 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -17,7 +17,6 @@
 #include "access/multixact.h"
 #include "access/xact.h"
 #include "fmgr.h"
-#include "utils/injection_point.h"
 
 PG_FUNCTION_INFO_V1(test_create_multixact);
 PG_FUNCTION_INFO_V1(test_read_multixact);
@@ -37,8 +36,7 @@ test_create_multixact(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reads given multixact after running an injection point. Discards local cache
- * to make a real read.  Tailored for multixact testing.
+ * Reads given multixact.  Discards local cache to make a real read.
  */
 Datum
 test_read_multixact(PG_FUNCTION_ARGS)
@@ -46,7 +44,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
-- 
2.47.3

From a4ed9e8a360de05cddc167f16dd7652242eaef5a Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 1 Dec 2025 14:14:15 +0200
Subject: [PATCH v15-pg14 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <[email protected]>
Reviewed-by: Dmitry Yurichev <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Ivan Bykov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 186 +++++++++++++++++++------
 1 file changed, 145 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index d56269d941c..7013e1b188c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -337,6 +337,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -868,13 +871,61 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
+
 	/*
 	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
 	 * to complain about if there's any I/O error.  This is kinda bogus, but
@@ -886,9 +937,37 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetSLRULock);
@@ -897,7 +976,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1072,8 +1151,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1314,21 +1396,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We sleep for a bit and try again.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1340,7 +1415,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
@@ -1385,13 +1459,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(MultiXactOffsetSLRULock);
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1427,7 +1498,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -2056,24 +2127,34 @@ TrimMultiXact(void)
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
@@ -3255,13 +3336,21 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
 
-		LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(MultiXactOffsetSLRULock);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			LWLockRelease(MultiXactOffsetSLRULock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page %d because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3285,6 +3374,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

From 2255bed329be004b14f27af5084e7732ab3dd09b Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 1 Dec 2025 14:14:05 +0200
Subject: [PATCH v15-pg15 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <[email protected]>
Reviewed-by: Dmitry Yurichev <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Ivan Bykov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 186 +++++++++++++++++++------
 1 file changed, 145 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 136065125ea..5fc53420655 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -337,6 +337,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -868,13 +871,61 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
+
 	/*
 	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
 	 * to complain about if there's any I/O error.  This is kinda bogus, but
@@ -886,9 +937,37 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetSLRULock);
@@ -897,7 +976,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1072,8 +1151,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1314,21 +1396,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We sleep for a bit and try again.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1340,7 +1415,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
@@ -1385,13 +1459,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(MultiXactOffsetSLRULock);
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1427,7 +1498,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -2056,24 +2127,34 @@ TrimMultiXact(void)
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
@@ -3255,13 +3336,21 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
 
-		LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(MultiXactOffsetSLRULock);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			LWLockRelease(MultiXactOffsetSLRULock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page %d because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3285,6 +3374,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

From b88bc9227cb34607e5d66140cffb3a0f7a214379 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 1 Dec 2025 14:13:54 +0200
Subject: [PATCH v15-pg16 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <[email protected]>
Reviewed-by: Dmitry Yurichev <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Ivan Bykov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 186 +++++++++++++++++++------
 1 file changed, 145 insertions(+), 41 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3a2d7055c42..43fec0afcd0 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -338,6 +338,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -869,13 +872,61 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	{
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
+
 	/*
 	 * Note: we pass the MultiXactId to SimpleLruReadPage as the "transaction"
 	 * to complain about if there's any I/O error.  This is kinda bogus, but
@@ -887,9 +938,37 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Exchange our lock */
 	LWLockRelease(MultiXactOffsetSLRULock);
@@ -898,7 +977,7 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1073,8 +1152,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1315,21 +1397,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We sleep for a bit and try again.
-	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1341,7 +1416,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
 
 	pageno = MultiXactIdToOffsetPage(multi);
@@ -1386,13 +1460,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(MultiXactOffsetSLRULock);
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(1000L);
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1428,7 +1499,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -2055,24 +2126,34 @@ TrimMultiXact(void)
 	MultiXactOffsetCtl->shared->latest_page_number = pageno;
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 	}
@@ -3251,13 +3332,21 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(int));
 
-		LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
-
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
-
-		LWLockRelease(MultiXactOffsetSLRULock);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			LWLockAcquire(MultiXactOffsetSLRULock, LW_EXCLUSIVE);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			LWLockRelease(MultiXactOffsetSLRULock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page %d because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3281,6 +3370,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

From b12c8cba2d0b5071bdb25dc3353ff6fd7b55f8c5 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 1 Dec 2025 14:13:40 +0200
Subject: [PATCH v15-pg17 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <[email protected]>
Reviewed-by: Dmitry Yurichev <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Ivan Bykov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c | 224 ++++++++++++++++++-------
 1 file changed, 159 insertions(+), 65 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b7b47ef076a..2f37ad88190 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -274,12 +274,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -384,6 +378,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -915,13 +912,68 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		int			slotno;
+		LWLock	   *lock;
+
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(lock);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -936,22 +988,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
+
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1141,8 +1221,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1307,7 +1390,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1384,23 +1466,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
-	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1412,7 +1485,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1475,16 +1547,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1492,12 +1558,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1540,7 +1600,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1987,7 +2047,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2194,26 +2253,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3398,14 +3467,24 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
 
-		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
 
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
-		LWLockRelease(lock);
+			LWLockRelease(lock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page " INT64_FORMAT " because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3431,6 +3510,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
-- 
2.47.3

From 17efc1fef539672e064c30ea2c809b4f06942354 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <[email protected]>
Date: Mon, 1 Dec 2025 14:13:09 +0200
Subject: [PATCH v15-pg18 1/1] Set next multixid's offset when creating a new
 multixid
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

With this commit, the next multixid's offset will always be set on the
offsets page, by the time that a backend might try to read it, so we
no longer need the waiting mechanism with the condition variable. In
other words, this eliminates "corner case 2" mentioned in the
comments.

The waiting mechanism was broken in a few scenarios:

- When nextMulti was advanced without WAL-logging the next
  multixid. For example, if a later multixid was already assigned and
  WAL-logged before the previous one was WAL-logged, and then the
  server crashed. In that case the next offset would never be set in
  the offsets SLRU, and a query trying to read it would get stuck
  waiting for it. Same thing could happen if pg_resetwal was used to
  forcibly advance nextMulti.

- In hot standby mode, a deadlock could happen where one backend waits
  for the next multixid assignment record, but WAL replay is not
  advancing because of a recovery conflict with the waiting backend.

The old TAP test used carefully placed injection points to exercise
the old waiting code, but now that the waiting code is gone, much of
the old test is no longer relevant. Rewrite the test to reproduce the
IPC/MultixactCreation hang after crash recovery instead, and to verify
that previously recorded multixids stay readable.

Backpatch to all supported versions. In back-branches, we still need
to be able to read WAL that was generated before this fix, so in the
back-branches this includes a hack to initialize the next offsets page
when replaying XLOG_MULTIXACT_CREATE_ID for the last multixid on a
page. On 'master', bump XLOG_PAGE_MAGIC instead to indicate that the
WAL is not compatible.

Author: Andrey Borodin <[email protected]>
Reviewed-by: Dmitry Yurichev <[email protected]>
Reviewed-by: Álvaro Herrera <[email protected]>
Reviewed-by: Kirill Reshke <[email protected]>
Reviewed-by: Ivan Bykov <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://www.postgresql.org/message-id/[email protected]
Backpatch-through: 14
---
 src/backend/access/transam/multixact.c        | 227 ++++++++++++------
 src/test/modules/test_slru/t/001_multixact.pl | 116 +++------
 src/test/modules/test_slru/test_multixact.c   |   5 +-
 3 files changed, 191 insertions(+), 157 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index f94445bdd07..e69ce955bec 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -84,7 +84,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/autovacuum.h"
-#include "storage/condition_variable.h"
 #include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
@@ -276,12 +275,6 @@ typedef struct MultiXactStateData
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;	/* known if oldestOffsetKnown */
 
-	/*
-	 * This is used to sleep until a multixact offset is written when we want
-	 * to create the next one.
-	 */
-	ConditionVariable nextoff_cv;
-
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
 	 * immediately following the MultiXactStateData struct. Each is indexed by
@@ -386,6 +379,9 @@ static MemoryContext MXactContext = NULL;
 #define debug_elog6(a,b,c,d,e,f)
 #endif
 
+/* hack to deal with WAL generated with older minor versions */
+static int64 pre_initialized_offsets_page = -1;
+
 /* internal MultiXactId management */
 static void MultiXactIdSetOldestVisible(void);
 static void RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
@@ -922,13 +918,68 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	int			entryno;
 	int			slotno;
 	MultiXactOffset *offptr;
-	int			i;
+	MultiXactId next;
+	int64		next_pageno;
+	int			next_entryno;
+	MultiXactOffset *next_offptr;
 	LWLock	   *lock;
 	LWLock	   *prevlock = NULL;
 
+	/* position of this multixid in the offsets SLRU area  */
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
+	/* position of the next multixid */
+	next = multi + 1;
+	if (next < FirstMultiXactId)
+		next = FirstMultiXactId;
+	next_pageno = MultiXactIdToOffsetPage(next);
+	next_entryno = MultiXactIdToOffsetEntry(next);
+
+	/*
+	 * Older minor versions didn't set the next multixid's offset in this
+	 * function, and therefore didn't initialize the next page until the next
+	 * multixid was assigned.  If we're replaying WAL that was generated by
+	 * such a version, the next page might not be initialized yet.  Initialize
+	 * it now.
+	 */
+	if (InRecovery &&
+		next_pageno != pageno &&
+		pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number) == pageno)
+	{
+		int			slotno;
+		LWLock	   *lock;
+
+		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		/* Create and zero the page */
+		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+		/* Make sure it's written out */
+		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+		LWLockRelease(lock);
+
+		/*
+		 * Remember that we initialized the page, so that we don't zero it
+		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+		 */
+		pre_initialized_offsets_page = next_pageno;
+	}
+
+	/*
+	 * Set the starting offset of this multixid's members.
+	 *
+	 * In the common case, it was already be set by the previous
+	 * RecordNewMultiXact call, as this was the next multixid of the previous
+	 * multixid.  But if multiple backends are generating multixids
+	 * concurrently, we might race ahead and get called before the previous
+	 * multixid.
+	 */
 	lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 	LWLockAcquire(lock, LW_EXCLUSIVE);
 
@@ -943,22 +994,50 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 	offptr += entryno;
 
-	*offptr = offset;
+	if (*offptr != offset)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*offptr == 0);
+		*offptr = offset;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
+
+	/*
+	 * Set the next multixid's offset to the end of this multixid's members.
+	 */
+	if (next_pageno == pageno)
+	{
+		next_offptr = offptr + 1;
+	}
+	else
+	{
+		/* must be the first entry on the page */
+		Assert(next_entryno == 0 || next == FirstMultiXactId);
+
+		/* Swap the lock for a lock on the next page */
+		LWLockRelease(lock);
+		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, next_pageno);
+		LWLockAcquire(lock, LW_EXCLUSIVE);
+
+		slotno = SimpleLruReadPage(MultiXactOffsetCtl, next_pageno, true, next);
+		next_offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
+		next_offptr += next_entryno;
+	}
 
-	MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	if (*next_offptr != offset + nmembers)
+	{
+		/* should already be set to the correct value, or not at all */
+		Assert(*next_offptr == 0);
+		*next_offptr = offset + nmembers;
+		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
+	}
 
 	/* Release MultiXactOffset SLRU lock. */
 	LWLockRelease(lock);
 
-	/*
-	 * If anybody was waiting to know the offset of this multixact ID we just
-	 * wrote, they can read it now, so wake them up.
-	 */
-	ConditionVariableBroadcast(&MultiXactState->nextoff_cv);
-
 	prev_pageno = -1;
 
-	for (i = 0; i < nmembers; i++, offset++)
+	for (int i = 0; i < nmembers; i++, offset++)
 	{
 		TransactionId *memberptr;
 		uint32	   *flagsptr;
@@ -1148,8 +1227,11 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 			result = FirstMultiXactId;
 	}
 
-	/* Make sure there is room for the MXID in the file.  */
-	ExtendMultiXactOffset(result);
+	/*
+	 * Make sure there is room for the next MXID in the file.  Assigning this
+	 * MXID sets the next MXID's offset already.
+	 */
+	ExtendMultiXactOffset(result + 1);
 
 	/*
 	 * Reserve the members space, similarly to above.  Also, be careful not to
@@ -1314,7 +1396,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	MultiXactOffset nextOffset;
 	MultiXactMember *ptr;
 	LWLock	   *lock;
-	bool		slept = false;
 
 	debug_elog3(DEBUG2, "GetMembers: asked for %u", multi);
 
@@ -1391,23 +1472,14 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * one's.  However, there are some corner cases to worry about:
 	 *
 	 * 1. This multixact may be the latest one created, in which case there is
-	 * no next one to look at.  In this case the nextOffset value we just
-	 * saved is the correct endpoint.
-	 *
-	 * 2. The next multixact may still be in process of being filled in: that
-	 * is, another process may have done GetNewMultiXactId but not yet written
-	 * the offset entry for that ID.  In that scenario, it is guaranteed that
-	 * the offset entry for that multixact exists (because GetNewMultiXactId
-	 * won't release MultiXactGenLock until it does) but contains zero
-	 * (because we are careful to pre-zero offset pages). Because
-	 * GetNewMultiXactId will never return zero as the starting offset for a
-	 * multixact, when we read zero as the next multixact's offset, we know we
-	 * have this case.  We handle this by sleeping on the condition variable
-	 * we have just for this; the process in charge will signal the CV as soon
-	 * as it has finished writing the multixact offset.
+	 * no next one to look at.  The next multixact's offset should be set
+	 * already, as we set it in RecordNewMultiXact(), but we used to not do
+	 * that in older minor versions.  To cope with that case, if this
+	 * multixact is the latest one created, use the nextOffset value we read
+	 * above as the endpoint.
 	 *
-	 * 3. Because GetNewMultiXactId increments offset zero to offset one to
-	 * handle case #2, there is an ambiguity near the point of offset
+	 * 2. Because GetNewMultiXactId skips over offset zero, to reserve zero
+	 * for to mean "unset", there is an ambiguity near the point of offset
 	 * wraparound.  If we see next multixact's offset is one, is that our
 	 * multixact's actual endpoint, or did it end at zero with a subsequent
 	 * increment?  We handle this using the knowledge that if the zero'th
@@ -1419,7 +1491,6 @@ GetMultiXactIdMembers(MultiXactId multi, MultiXactMember **members,
 	 * cases, so it seems better than holding the MultiXactGenLock for a long
 	 * time on every multixact creation.
 	 */
-retry:
 	pageno = MultiXactIdToOffsetPage(multi);
 	entryno = MultiXactIdToOffsetEntry(multi);
 
@@ -1482,18 +1553,10 @@ retry:
 		nextMXOffset = *offptr;
 
 		if (nextMXOffset == 0)
-		{
-			/* Corner case 2: next multixact is still being filled in */
-			LWLockRelease(lock);
-			CHECK_FOR_INTERRUPTS();
-
-			INJECTION_POINT("multixact-get-members-cv-sleep", NULL);
-
-			ConditionVariableSleep(&MultiXactState->nextoff_cv,
-								   WAIT_EVENT_MULTIXACT_CREATION);
-			slept = true;
-			goto retry;
-		}
+			ereport(ERROR,
+					(errcode(ERRCODE_DATA_CORRUPTED),
+					 errmsg("MultiXact %u has invalid next offset",
+							multi)));
 
 		length = nextMXOffset - offset;
 	}
@@ -1501,12 +1564,6 @@ retry:
 	LWLockRelease(lock);
 	lock = NULL;
 
-	/*
-	 * If we slept above, clean up state; it's no longer needed.
-	 */
-	if (slept)
-		ConditionVariableCancelSleep();
-
 	ptr = (MultiXactMember *) palloc(length * sizeof(MultiXactMember));
 
 	truelength = 0;
@@ -1549,7 +1606,7 @@ retry:
 
 		if (!TransactionIdIsValid(*xactptr))
 		{
-			/* Corner case 3: we must be looking at unused slot zero */
+			/* Corner case 2: we must be looking at unused slot zero */
 			Assert(offset == 0);
 			continue;
 		}
@@ -1996,7 +2053,6 @@ MultiXactShmemInit(void)
 
 		/* Make sure we zero out the per-backend state */
 		MemSet(MultiXactState, 0, SHARED_MULTIXACT_STATE_SIZE);
-		ConditionVariableInit(&MultiXactState->nextoff_cv);
 	}
 	else
 		Assert(found);
@@ -2203,26 +2259,36 @@ TrimMultiXact(void)
 						pageno);
 
 	/*
+	 * Set the offset of the last multixact on the offsets page.
+	 *
+	 * This is normally done in RecordNewMultiXact() of the previous
+	 * multixact, but we used to not do that in older minor versions.  To
+	 * ensure that the next offset is set if the binary was just upgraded from
+	 * an older minor version, do it now.
+	 *
 	 * Zero out the remainder of the current offsets page.  See notes in
 	 * TrimCLOG() for background.  Unlike CLOG, some WAL record covers every
 	 * pg_multixact SLRU mutation.  Since, also unlike CLOG, we ignore the WAL
 	 * rule "write xlog before data," nextMXact successors may carry obsolete,
-	 * nonzero offset values.  Zero those so case 2 of GetMultiXactIdMembers()
-	 * operates normally.
+	 * nonzero offset values.
 	 */
 	entryno = MultiXactIdToOffsetEntry(nextMXact);
-	if (entryno != 0)
 	{
 		int			slotno;
 		MultiXactOffset *offptr;
 		LWLock	   *lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
 
 		LWLockAcquire(lock, LW_EXCLUSIVE);
-		slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
+		if (entryno == 0)
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, pageno);
+		else
+			slotno = SimpleLruReadPage(MultiXactOffsetCtl, pageno, true, nextMXact);
 		offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno];
 		offptr += entryno;
 
-		MemSet(offptr, 0, BLCKSZ - (entryno * sizeof(MultiXactOffset)));
+		*offptr = offset;
+		if (entryno != 0 && (entryno + 1) * sizeof(MultiXactOffset) != BLCKSZ)
+			MemSet(offptr + 1, 0, BLCKSZ - (entryno + 1) * sizeof(MultiXactOffset));
 
 		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
 		LWLockRelease(lock);
@@ -3407,14 +3473,24 @@ multixact_redo(XLogReaderState *record)
 
 		memcpy(&pageno, XLogRecGetData(record), sizeof(pageno));
 
-		lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
-		LWLockAcquire(lock, LW_EXCLUSIVE);
+		/*
+		 * Skip the record if we already initialized the page at the previous
+		 * XLOG_MULTIXACT_CREATE_ID record. See RecordNewMultiXact().
+		 */
+		if (pre_initialized_offsets_page != pageno)
+		{
+			lock = SimpleLruGetBankLock(MultiXactOffsetCtl, pageno);
+			LWLockAcquire(lock, LW_EXCLUSIVE);
 
-		slotno = ZeroMultiXactOffsetPage(pageno, false);
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			slotno = ZeroMultiXactOffsetPage(pageno, false);
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
 
-		LWLockRelease(lock);
+			LWLockRelease(lock);
+		}
+		else
+			elog(DEBUG1, "skipping initialization of offsets page " INT64_FORMAT " because it was already initialized on multixid creation", pageno);
+		pre_initialized_offsets_page = -1;
 	}
 	else if (info == XLOG_MULTIXACT_ZERO_MEM_PAGE)
 	{
@@ -3440,6 +3516,21 @@ multixact_redo(XLogReaderState *record)
 		TransactionId max_xid;
 		int			i;
 
+		if (pre_initialized_offsets_page != -1)
+		{
+			/*
+			 * If we implicitly initialized the next offsets page while
+			 * replaying a XLOG_MULTIXACT_CREATE_ID record that was generated
+			 * with an older minor version, we still expect to see a
+			 * XLOG_MULTIXACT_ZERO_OFF_PAGE record for it before any other
+			 * XLOG_MULTIXACT_CREATE_ID records.  Therefore this case should
+			 * not happen.  If it does, we'll continue with the replay, but
+			 * log a message to note that something's funny.
+			 */
+			elog(LOG, "expected to see a XLOG_MULTIXACT_ZERO_OFF_PAGE record for page that was implicitly initialized earlier");
+			pre_initialized_offsets_page = -1;
+		}
+
 		/* Store the data back into the SLRU files */
 		RecordNewMultiXact(xlrec->mid, xlrec->moff, xlrec->nmembers,
 						   xlrec->members);
diff --git a/src/test/modules/test_slru/t/001_multixact.pl b/src/test/modules/test_slru/t/001_multixact.pl
index e2b567a603d..7837eb810f0 100644
--- a/src/test/modules/test_slru/t/001_multixact.pl
+++ b/src/test/modules/test_slru/t/001_multixact.pl
@@ -1,10 +1,6 @@
 # Copyright (c) 2024-2025, PostgreSQL Global Development Group
 
-# This test verifies edge case of reading a multixact:
-# when we have multixact that is followed by exactly one another multixact,
-# and another multixact have no offset yet, we must wait until this offset
-# becomes observable. Previously we used to wait for 1ms in a loop in this
-# case, but now we use CV for this. This test is exercising such a sleep.
+# Test multixid corner cases.
 
 use strict;
 use warnings FATAL => 'all';
@@ -19,9 +15,7 @@ if ($ENV{enable_injection_points} ne 'yes')
 	plan skip_all => 'Injection points not supported by this build';
 }
 
-my ($node, $result);
-
-$node = PostgreSQL::Test::Cluster->new('mike');
+my $node = PostgreSQL::Test::Cluster->new('main');
 $node->init;
 $node->append_conf('postgresql.conf',
 	"shared_preload_libraries = 'test_slru,injection_points'");
@@ -29,95 +23,47 @@ $node->start;
 $node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
 $node->safe_psql('postgres', q(CREATE EXTENSION test_slru));
 
-# Test for Multixact generation edge case
-$node->safe_psql('postgres',
-	q{select injection_points_attach('test-multixact-read','wait')});
-$node->safe_psql('postgres',
-	q{select injection_points_attach('multixact-get-members-cv-sleep','wait')}
-);
+# This test creates three multixacts. The middle one is never
+# WAL-logged or recorded on the offsets page, because we pause the
+# backend and crash the server before that. After restart, verify that
+# the other multixacts are readable, despite the middle one being
+# lost.
 
-# This session must observe sleep on the condition variable while generating a
-# multixact.  To achieve this it first will create a multixact, then pause
-# before reading it.
-my $observer = $node->background_psql('postgres');
-
-# This query will create a multixact, and hang just before reading it.
-$observer->query_until(
-	qr/start/,
-	q{
-	\echo start
-	SELECT test_read_multixact(test_create_multixact());
-});
-$node->wait_for_event('client backend', 'test-multixact-read');
-
-# This session will create the next Multixact. This is necessary to avoid
-# multixact.c's non-sleeping edge case 1.
-my $creator = $node->background_psql('postgres');
+# Create the first multixact
+my $bg_psql = $node->background_psql('postgres');
+my $multi1 = $bg_psql->query_safe(q(SELECT test_create_multixact();));
+
+# Assign the middle multixact. Use an injection point to prevent it
+# from being fully recorded.
 $node->safe_psql('postgres',
 	q{SELECT injection_points_attach('multixact-create-from-members','wait');}
 );
 
-# We expect this query to hang in the critical section after generating new
-# multixact, but before filling its offset into SLRU.
-# Running an injection point inside a critical section requires it to be
-# loaded beforehand.
-$creator->query_until(
-	qr/start/, q{
-	\echo start
+$bg_psql->query_until(
+	qr/assigning lost multi/, q(
+\echo assigning lost multi
 	SELECT test_create_multixact();
-});
+));
 
 $node->wait_for_event('client backend', 'multixact-create-from-members');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, test-multixact-read',
-	"matching injection point waits");
-
-# Now wake observer to get it to read the initial multixact.  A subsequent
-# multixact already exists, but that one doesn't have an offset assigned, so
-# this will hit multixact.c's edge case 2.
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('test-multixact-read')});
-$node->wait_for_event('client backend', 'multixact-get-members-cv-sleep');
-
-# Ensure we have the backends waiting that we expect
-is( $node->safe_psql(
-		'postgres',
-		q{SELECT string_agg(wait_event, ', ' ORDER BY wait_event)
-		FROM pg_stat_activity WHERE wait_event_type = 'InjectionPoint'}
-	),
-	'multixact-create-from-members, multixact-get-members-cv-sleep',
-	"matching injection point waits");
-
-# Now we have two backends waiting in multixact-create-from-members and
-# multixact-get-members-cv-sleep.  Also we have 3 injections points set to wait.
-# If we wakeup multixact-get-members-cv-sleep it will happen again, so we must
-# detach it first. So let's detach all injection points, then wake up all
-# backends.
-
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('test-multixact-read')});
 $node->safe_psql('postgres',
 	q{SELECT injection_points_detach('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_detach('multixact-get-members-cv-sleep')});
 
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-create-from-members')});
-$node->safe_psql('postgres',
-	q{SELECT injection_points_wakeup('multixact-get-members-cv-sleep')});
+# Create the third multixid
+my $multi2 = $node->safe_psql('postgres', q{SELECT test_create_multixact();});
+
+# All set and done, it's time for hard restart
+$node->stop('immediate');
+$node->start;
+$bg_psql->{run}->finish;
 
-# Background psql will now be able to read the result and disconnect.
-$observer->quit;
-$creator->quit;
+# Verify that the recorded multixids are readable
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi1');}),
+	'',
+	'first recorded multi is readable');
 
-$node->stop;
+is( $node->safe_psql('postgres', qq{SELECT test_read_multixact('$multi2');}),
+	'',
+	'second recorded multi is readable');
 
-# If we reached this point - everything is OK.
-ok(1);
 done_testing();
diff --git a/src/test/modules/test_slru/test_multixact.c b/src/test/modules/test_slru/test_multixact.c
index 6c9b0420717..8fb6c19d70f 100644
--- a/src/test/modules/test_slru/test_multixact.c
+++ b/src/test/modules/test_slru/test_multixact.c
@@ -17,7 +17,6 @@
 #include "access/multixact.h"
 #include "access/xact.h"
 #include "fmgr.h"
-#include "utils/injection_point.h"
 
 PG_FUNCTION_INFO_V1(test_create_multixact);
 PG_FUNCTION_INFO_V1(test_read_multixact);
@@ -37,8 +36,7 @@ test_create_multixact(PG_FUNCTION_ARGS)
 }
 
 /*
- * Reads given multixact after running an injection point. Discards local cache
- * to make a real read.  Tailored for multixact testing.
+ * Reads given multixact.  Discards local cache to make a real read.
  */
 Datum
 test_read_multixact(PG_FUNCTION_ARGS)
@@ -46,7 +44,6 @@ test_read_multixact(PG_FUNCTION_ARGS)
 	MultiXactId id = PG_GETARG_TRANSACTIONID(0);
 	MultiXactMember *members;
 
-	INJECTION_POINT("test-multixact-read", NULL);
 	/* discard caches */
 	AtEOXact_MultiXact();
 
-- 
2.47.3

Re: IPC/MultixactCreation on the Standby server

Reply via email to