On Fri, Jun 30, 2023 at 3:39 AM Andres Freund <and...@anarazel.de> wrote:
> I am wondering if we don't want something more generic than stashing this in
> rd_amcache. Don't want to end up duplicating relevant code across the uses of
> rd_amcache in every AM.

I suppose we could try to track hot pages automatically.  Ants Aasma
mentioned that he was working on a tiny SIMD-based LRU that could be
useful for something like that, without being too slow.  Just for
fun/experimentation, here's a simple attempt to use a very stupid
stand-in LRU to cache the most recent 8 lookups for each fork of each
relation.  Obviously that will miss every time for many workloads so
it'd have to be pretty close to free and this code probably isn't good
enough, but perhaps Ants knows how to sprinkle the right magic fairy
dust on it.  It should automatically discover things like root pages,
the heap target block during repeat inserts etc, and visibility map
pages would stick around, and perhaps a few more pages of btrees that
have a very hot key range (but not pgbench).

> I do wonder if we should have an unlocked pre-check for a) the buffer being
> valid and b) BufferTagsEqual() matching.  With such a pre-check the race for
> increasing the usage count of the wrong buffer is quite small, without the
> pre-check it doesn't seem that small anymore.

Yeah, that makes sense.  Done in this version.
From d5b9f345961e2adb31213c645e40037f15ba6a83 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Thu, 29 Jun 2023 10:52:56 +1200
Subject: [PATCH v2 1/2] Improve ReadRecentBuffer() scalability.

While testing a new potential use for ReadRecentBuffer(), Andres
reported that it scales badly when called concurrently for the same
buffer by many backends.  Instead of a naive (but wrong) coding with
PinBuffer(), it used the spinlock, so that it could be careful to pin
only if the buffer was valid and holding the expected block, to avoid
breaking invariants in eg GetVictimBuffer().  Unfortunately that made it
less scalable than PinBuffer(), which uses compare-exchange instead.

We can fix that by giving PinBuffer() a new skip_if_not_valid mode that
doesn't pin invalid buffers.  It might occasionally skip when it
shouldn't due to the unlocked read of the header flags, but that's
unlikely and perfectly acceptable for an opportunistic optimisation
routine, and it can only succeed when it really should due to the
compare-exchange loop.

XXX This also fixes ReadRecentBuffer()'s failure to bump the usage
count.  Fix separately or back-patch this?

Reported-by: Andres Freund <and...@anarazel.de>
Reviewed-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/20230627020546.t6z4tntmj7wmjrfh%40awork3.anarazel.de
---
 src/backend/storage/buffer/bufmgr.c | 63 +++++++++++++----------------
 1 file changed, 29 insertions(+), 34 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3c59bbd04e..f7a8e0d576 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -467,7 +467,8 @@ static BlockNumber ExtendBufferedRelShared(ExtendBufferedWhat eb,
 										   BlockNumber extend_upto,
 										   Buffer *buffers,
 										   uint32 *extended_by);
-static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy);
+static bool PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
+					  bool skip_if_not_valid);
 static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void BufferSync(int flags);
@@ -635,7 +636,6 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	BufferDesc *bufHdr;
 	BufferTag	tag;
 	uint32		buf_state;
-	bool		have_private_ref;
 
 	Assert(BufferIsValid(recent_buffer));
 
@@ -663,38 +663,24 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 	else
 	{
 		bufHdr = GetBufferDescriptor(recent_buffer - 1);
-		have_private_ref = GetPrivateRefCount(recent_buffer) > 0;
 
 		/*
-		 * Do we already have this buffer pinned with a private reference?  If
-		 * so, it must be valid and it is safe to check the tag without
-		 * locking.  If not, we have to lock the header first and then check.
+		 * Is it still valid and holding the right tag?  We do an unlocked
+		 * tag comparison first, to make it unlikely that we'll increment the
+		 * usage counter of the wrong buffer, if someone calls us with a very
+		 * out of date recent_buffer.  Then we'll check it again if we get the
+		 * pin.
 		 */
-		if (have_private_ref)
-			buf_state = pg_atomic_read_u32(&bufHdr->state);
-		else
-			buf_state = LockBufHdr(bufHdr);
-
-		if ((buf_state & BM_VALID) && BufferTagsEqual(&tag, &bufHdr->tag))
+		if (BufferTagsEqual(&tag, &bufHdr->tag) &&
+			PinBuffer(bufHdr, NULL, true))
 		{
-			/*
-			 * It's now safe to pin the buffer.  We can't pin first and ask
-			 * questions later, because it might confuse code paths like
-			 * InvalidateBuffer() if we pinned a random non-matching buffer.
-			 */
-			if (have_private_ref)
-				PinBuffer(bufHdr, NULL);	/* bump pin count */
-			else
-				PinBuffer_Locked(bufHdr);	/* pin for first time */
-
-			pgBufferUsage.shared_blks_hit++;
-
-			return true;
+			if (BufferTagsEqual(&tag, &bufHdr->tag))
+			{
+				pgBufferUsage.shared_blks_hit++;
+				return true;
+			}
+			UnpinBuffer(bufHdr);
 		}
-
-		/* If we locked the header above, now unlock. */
-		if (!have_private_ref)
-			UnlockBufHdr(bufHdr, buf_state);
 	}
 
 	return false;
@@ -1252,7 +1238,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		buf = GetBufferDescriptor(existing_buf_id);
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(buf, strategy, false);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1330,7 +1316,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		existing_buf_hdr = GetBufferDescriptor(existing_buf_id);
 
-		valid = PinBuffer(existing_buf_hdr, strategy);
+		valid = PinBuffer(existing_buf_hdr, strategy, false);
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
@@ -1979,7 +1965,7 @@ ExtendBufferedRelShared(ExtendBufferedWhat eb,
 			 * Pin the existing buffer before releasing the partition lock,
 			 * preventing it from being evicted.
 			 */
-			valid = PinBuffer(existing_hdr, strategy);
+			valid = PinBuffer(existing_hdr, strategy, false);
 
 			LWLockRelease(partition_lock);
 
@@ -2225,10 +2211,13 @@ ReleaseAndReadBuffer(Buffer buffer,
  * Note that ResourceOwnerEnlargeBuffers must have been done already.
  *
  * Returns true if buffer is BM_VALID, else false.  This provision allows
- * some callers to avoid an extra spinlock cycle.
+ * some callers to avoid an extra spinlock cycle.  If skip_if_not_valid is
+ * true, then a false return value also indicates that the buffer was
+ * (recently) invalid and has not been pinned.
  */
 static bool
-PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
+PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy,
+		  bool skip_if_not_valid)
 {
 	Buffer		b = BufferDescriptorGetBuffer(buf);
 	bool		result;
@@ -2252,6 +2241,12 @@ PinBuffer(BufferDesc *buf, BufferAccessStrategy strategy)
 			if (old_buf_state & BM_LOCKED)
 				old_buf_state = WaitBufHdrUnlocked(buf);
 
+			if (unlikely(skip_if_not_valid && !(old_buf_state & BM_VALID)))
+			{
+				ForgetPrivateRefCountEntry(ref);
+				return false;
+			}
+
 			buf_state = old_buf_state;
 
 			/* increase refcount */
-- 
2.40.1

From 3cf3729c181aeb5b2770940200ec43656977ac97 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.mu...@gmail.com>
Date: Fri, 30 Jun 2023 12:25:26 +1200
Subject: [PATCH v2 2/2] Introduce a tiny LRU cache for buffer mapping.

Remember which buffer held the last N blocks we accessed for each fork
of each relation, so we can try to skip the shared memory buffer mapping
table (and more importantly its partition locks).

XXX This is a very dumb and slow way of coding an LRU mapping table to
show the concept, with the theory that a clever and fast coding might be
possible?
---
 src/backend/storage/buffer/bufmgr.c | 102 ++++++++++++++++++++++++++++
 src/backend/storage/smgr/smgr.c     |   4 ++
 src/include/storage/smgr.h          |  13 ++++
 3 files changed, 119 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7a8e0d576..34114f6188 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -969,6 +969,87 @@ ExtendBufferedRelTo(ExtendBufferedWhat eb,
 	return buffer;
 }
 
+/*
+ * Try to find a buffer using our tiny LRU cache, to avoid a trip through the
+ * buffer mapping table.  Only for non-local RBM_NORMAL reads.
+ */
+static Buffer
+ReadBuffer_try_recent(SMgrRelation smgr, ForkNumber forkNum,
+					  BlockNumber blockNum)
+{
+	SMgrBufferLruEntry *buffer_lru = &smgr->recent_buffer_lru[forkNum][0];
+
+	Assert(BlockNumberIsValid(blockNum));
+	for (int i = 0; i < SMGR_BUFFER_LRU_SIZE; ++i)
+	{
+		if (buffer_lru[i].block == blockNum)
+		{
+			SMgrBufferLruEntry found = buffer_lru[i];
+
+			if (ReadRecentBuffer(smgr->smgr_rlocator.locator,
+								 forkNum,
+								 blockNum,
+								 found.buffer))
+			{
+				/* Move to front. */
+				if (i > 0)
+				{
+					memmove(&buffer_lru[1],
+							&buffer_lru[0],
+							sizeof(buffer_lru[0]) * i);
+					buffer_lru[0] = found;
+				}
+				return found.buffer;
+			}
+
+			/* Kill this entry and give up. */
+			if (i < SMGR_BUFFER_LRU_SIZE - 1)
+				memmove(&buffer_lru[i],
+						&buffer_lru[i + 1],
+						SMGR_BUFFER_LRU_SIZE - i);
+			buffer_lru[SMGR_BUFFER_LRU_SIZE - 1].block = InvalidBlockNumber;
+			break;
+		}
+	}
+
+	return InvalidBuffer;
+}
+
+/*
+ * Remember which buffer a block is in, for later lookups by
+ * ReadBuffer_try_recent().
+ */
+static void
+RememberRecentBuffer(SMgrRelation smgr, ForkNumber forkNum,
+					 BlockNumber blockNum, Buffer buffer)
+{
+	SMgrBufferLruEntry *buffer_lru = &smgr->recent_buffer_lru[forkNum][0];
+
+	/* If it's already there, move to front and update. */
+	for (int i = 0; i < SMGR_BUFFER_LRU_SIZE; ++i)
+	{
+		if (buffer_lru[i].block == blockNum)
+		{
+			if (i > 0)
+			{
+				memmove(&buffer_lru[1],
+						&buffer_lru[0],
+						sizeof(buffer_lru[0]) * i);
+				buffer_lru[0].block = blockNum;
+			}
+			buffer_lru[0].buffer = buffer;
+			return;
+		}
+	}
+
+	/* Otherwise insert at front. */
+	memmove(&buffer_lru[1],
+			&buffer_lru[0],
+			sizeof(buffer_lru[0]) * (SMGR_BUFFER_LRU_SIZE - 1));
+	buffer_lru[0].block = blockNum;
+	buffer_lru[0].buffer = buffer;
+}
+
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
  *
@@ -985,6 +1066,7 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	IOContext	io_context;
 	IOObject	io_object;
 	bool		isLocalBuf = SmgrIsTemp(smgr);
+	Buffer		recent_buffer;
 
 	*hit = false;
 
@@ -1035,6 +1117,20 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.local_blks_read++;
 	}
+	else if (mode == RBM_NORMAL &&
+			 BufferIsValid((recent_buffer = ReadBuffer_try_recent(smgr,
+																  forkNum,
+																  blockNum))))
+	{
+		/*
+		 * Pinned buffer without having to look it up in the shared buffer
+		 * mapping table.
+		 */
+		found = true;
+		bufHdr = GetBufferDescriptor(recent_buffer - 1);
+		io_context = IOCONTEXT_NORMAL;
+		io_object = IOOBJECT_RELATION;
+	}
 	else
 	{
 		/*
@@ -1050,6 +1146,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
 				 mode == RBM_ZERO_ON_ERROR)
 			pgBufferUsage.shared_blks_read++;
+
+		RememberRecentBuffer(smgr, forkNum, blockNum,
+							 BufferDescriptorGetBuffer(bufHdr));
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -1163,6 +1262,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	{
 		/* Set BM_VALID, terminate IO, and wake up any waiters */
 		TerminateBufferIO(bufHdr, false, BM_VALID);
+
+		RememberRecentBuffer(smgr, forkNum, blockNum,
+							 BufferDescriptorGetBuffer(bufHdr));
 	}
 
 	VacuumPageMiss++;
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f76c4605db..cd04647a54 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -179,7 +179,11 @@ smgropen(RelFileLocator rlocator, BackendId backend)
 		reln->smgr_owner = NULL;
 		reln->smgr_targblock = InvalidBlockNumber;
 		for (int i = 0; i <= MAX_FORKNUM; ++i)
+		{
 			reln->smgr_cached_nblocks[i] = InvalidBlockNumber;
+			for (int j = 0; j < SMGR_BUFFER_LRU_SIZE; ++j)
+				reln->recent_buffer_lru[i][j].block = InvalidBlockNumber;
+		}
 		reln->smgr_which = 0;	/* we only have md.c at present */
 
 		/* implementation-specific initialization */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index a9a179aaba..6d4b0fc0b4 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -16,8 +16,18 @@
 
 #include "lib/ilist.h"
 #include "storage/block.h"
+#include "storage/buf.h"
 #include "storage/relfilelocator.h"
 
+/* For each fork we'll use one (typical) cache line of recent memory. */
+#define SMGR_BUFFER_LRU_SIZE 8
+
+typedef struct SMgrBufferLruEntry
+{
+	BlockNumber	block;
+	Buffer		buffer;
+} SMgrBufferLruEntry;
+
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -61,6 +71,9 @@ typedef struct SMgrRelationData
 	 */
 	int			smgr_which;		/* storage manager selector */
 
+	/* for bufmgr.c; cached recently accessed buffers */
+	SMgrBufferLruEntry recent_buffer_lru[MAX_FORKNUM + 1][SMGR_BUFFER_LRU_SIZE];
+
 	/*
 	 * for md.c; per-fork arrays of the number of open segments
 	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
-- 
2.40.1

Reply via email to