Re: Checkpointer write combining

Melanie Plageman Wed, 14 Jan 2026 15:36:35 -0800

On Wed, Jan 14, 2026 at 2:49 AM Chao Li <[email protected]> wrote:
>
>
> I went through the patch set again today, and paid special attention to 0001 
> and 0008 that I seemed to not review before. Here are comments I got today:


Thanks! v13 attached.

> --- a/src/include/storage/buf_internals.h
> +++ b/src/include/storage/buf_internals.h
> @@ -15,6 +15,7 @@
>  #ifndef BUFMGR_INTERNALS_H
>  #define BUFMGR_INTERNALS_H
>
> +#include "access/xlogdefs.h"
>
> I tried to build without adding this include, build still passed. I think 
> that’s because there is a include path: storage/bufmgr.h -> storage/bufpage.h 
> -> access/xlogger.h.
>
> So, maybe we can remove this include.

Generally, at least for new code, we try to avoid transitive includes.

> + * The buffer must be pinned and content locked and the buffer header 
> spinlock
> + * must not be held.
> + *
>   * Returns true if buffer manager should ask for a new victim, and false
>   * if this buffer should be written and re-used.
>   */
>  bool
>  StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool 
> from_ring)
>  {
> +       XLogRecPtr      lsn;
> +
> +       if (!strategy)
> +               return false;
> ```
>
> As the new comment stated, the buffer must be pinned, maybe we can add an 
> assert to ensure that:
> ```
> Assert(BufferIsPinned(buffer));
> ```
>
> Similarly, maybe we can also assert the locks are not held:
> ```
> Assert(BufferDescriptorGetContentLock(buffer));
> Assert(!pg_atomic_read_u32(&buf->state) & BM_LOCKED);

I've added something like this.

>
> 3 - 0001 - bufmgr.c - BufferNeedsWALFlush()
> ```
> +       buffer = BufferDescriptorGetBuffer(bufdesc);
> +       page = BufferGetPage(buffer);
> +
> +       Assert(BufferIsValid(buffer));
> ```
>
> I think the Assert should be moved to before "page = BufferGetPage(buffer);”.

I just deleted the assert -- BufferGetPage() already does it.

> +bool
> +BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive, XLogRecPtr *lsn)
> ```
>
> I think the “exclusive" parameter is a bit subtle. The parameter is not 
> explicitly explained in the header comment, though there is a paragraph that 
> explains different behaviors when the caller hold and not hold content lock. 
> Maybe we can rename to a more direct name: hold_exclusive_content_lock, or a 
> shorter one hold_content_lock.

I've done this.

I've also gone and fixed the various typos in code and commit messages
you mentioned.

> +/*
> + * Prepare the buffer with bufdesc for writing. Returns true if the buffer
> + * actually needs writing and false otherwise. lsn returns the buffer's LSN 
> if
> + * the table is logged.
> + */
> +static bool
> +PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
> +{
>         uint32          buf_state;
>
>         /*
> @@ -4425,42 +4445,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
> IOObject io_object,
>          * someone else flushed the buffer before we could, so we need not do
>          * anything.
>          */
> -       if (!StartBufferIO(buf, false, false))
> -               return;
>
> The header comment says “lsn returns the buffer's LSN if the table is 
> logged”, which looks inaccurate, because if StartBufferIO() is true, the 
> function returns early without setting *lsn.

Yes, good point. I should initialize it to InvalidXLogRecPtr before
calling StartBufferIO() and document it appropriately. I've done that.
While looking at this, I realized that PrepareFlushBuffer() becomes
useless once I make normal (non-strategy) buffer flushes do write
combining. This inspired a refactor whichI think simplifies the code.

Unfortunately, it means this patch set introduces PrepareFlushBuffer()
in 0001 and then deletes it in 0008. I'm not really sure what to do
about that, but maybe it will come to me later.

> +       BlockNumber max_batch_size = 3; /* we only look for two successors */
> ```
>
> Using type BlockNumber for batch size seems odd. I would suggest change to 
> uint32.

Right, at the least, I wasn't consistent with which one to use. I've changed it.

> 11 - 0008 - WriteBatchInit()
> ```
> +       LockBufHdr(batch_start);
> +       batch->max_lsn = BufferGetLSN(batch_start);
> +       UnlockBufHdr(batch_start);
> ```
>
> Should we check unlogged buffer before assigning max_lsn? In previous 
> commits, we have done that in many places.

Good point, when fixing this, I realized that I was getting the page
LSN more times than I needed to for the victim buffer. I restructured
it in a way that I think is more clear.

On a separate note, I've made some other changes -- like getting rid
of the BufferBlockRequirements struct I added, because I realized that
was just a BufferTag.

And I found a few places where I hadn't documented my expectations
around the buffer header spinlock. Because I'm checking fields in the
buffer header for a heuristic, I'm bending the rules about concurrency
control. Each time I look at the patch, I find a new mistake in how
I'm handling access to the buffer header.

Speaking of which, periodically I see a test failure related to a
block having invalid contents (in 027_stream_regress). It's not
reproducible, so I haven't investigated much yet. There's at least one
bug in this patch set that needs shaking out, but I think that will
have to wait for another day. I'll probably want to add some tests.

- Melanie

From 5b216371de7d37f88d51fdb330775f2ece1d021f Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 7 Jan 2026 13:32:18 -0500
Subject: [PATCH v13 1/8] Streamline buffer rejection for bulkreads of unlogged
 tables

Bulk-read buffer access strategies reject reusing a buffer from the
buffer access strategy ring if reusing it would require flushing WAL.
Unlogged relations never require WAL flushes, so this check can be
skipped. This avoids taking the buffer header lock unnecessarily.

Refactor this into StrategyRejectBuffer() itself, which also avoids LSN
checking for non-bulkread buffer access strategies.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/0198DBB9-4A76-49E4-87F8-43D46DD0FD76%40gmail.com#1d8677fc75dc8b39f0eb5dd6bbafb65d
---
 src/backend/storage/buffer/bufmgr.c   | 60 ++++++++++++++++++++-------
 src/backend/storage/buffer/freelist.c | 14 ++++++-
 src/include/storage/buf_internals.h   |  2 +
 3 files changed, 59 insertions(+), 17 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a036c2aa275..67f9a210872 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2483,25 +2483,14 @@ again:
 		 * If using a nondefault strategy, and writing the buffer would
 		 * require a WAL flush, let the strategy decide whether to go ahead
 		 * and write/reuse the buffer or to choose another victim.  We need a
-		 * lock to inspect the page LSN, so this can't be done inside
+		 * content lock to inspect the page LSN, so this can't be done inside
 		 * StrategyGetBuffer.
 		 */
-		if (strategy != NULL)
+		if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
 		{
-			XLogRecPtr	lsn;
-
-			/* Read the LSN while holding buffer header lock */
-			buf_state = LockBufHdr(buf_hdr);
-			lsn = BufferGetLSN(buf_hdr);
-			UnlockBufHdr(buf_hdr);
-
-			if (XLogNeedsFlush(lsn)
-				&& StrategyRejectBuffer(strategy, buf_hdr, from_ring))
-			{
-				LWLockRelease(content_lock);
-				UnpinBuffer(buf_hdr);
-				goto again;
-			}
+			LWLockRelease(content_lock);
+			UnpinBuffer(buf_hdr);
+			goto again;
 		}
 
 		/* OK, do the I/O */
@@ -3416,6 +3405,45 @@ TrackNewBufferPin(Buffer buf)
 							  BLCKSZ);
 }
 
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ *
+ * If the result is required to be correct, the caller must hold a buffer
+ * content lock. If they only hold a shared content lock, we'll need to
+ * acquire the buffer header spinlock, so they must not already hold it.
+ */
+bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive_locked)
+{
+	uint32		buf_state = pg_atomic_read_u32(&bufdesc->state);
+	Buffer		buffer;
+	char	   *page;
+	XLogRecPtr	lsn;
+
+	/*
+	 * Unlogged buffers can't need WAL flush. See FlushBuffer() for more
+	 * details on unlogged relations with LSNs.
+	 */
+	if (!(buf_state & BM_PERMANENT))
+		return false;
+
+	buffer = BufferDescriptorGetBuffer(bufdesc);
+	page = BufferGetPage(buffer);
+
+	if (!XLogHintBitIsNeeded() || BufferIsLocal(buffer) || exclusive_locked)
+		lsn = PageGetLSN(page);
+	else
+	{
+		/* Buffer is either share locked or not locked */
+		LockBufHdr(bufdesc);
+		lsn = PageGetLSN(page);
+		UnlockBufHdr(bufdesc);
+	}
+
+	return XLogNeedsFlush(lsn);
+}
+
+
 #define ST_SORT sort_checkpoint_bufferids
 #define ST_ELEMENT_TYPE CkptSortItem
 #define ST_COMPARE(a, b) ckpt_buforder_comparator(a, b)
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 9a93fb335fc..570b933ddb3 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -780,12 +780,18 @@ IOContextForStrategy(BufferAccessStrategy strategy)
  * be written out and doing so would require flushing WAL too.  This gives us
  * a chance to choose a different victim.
  *
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_ring)
 {
+	if (!strategy)
+		return false;
+
 	/* We only do this in bulkread mode */
 	if (strategy->btype != BAS_BULKREAD)
 		return false;
@@ -795,8 +801,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool from_r
 		strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
 		return false;
 
+	Assert(BufferIsLockedByMe(BufferDescriptorGetBuffer(buf)));
+	Assert(!(pg_atomic_read_u32(&buf->state) & BM_LOCKED));
+
+	if (!BufferNeedsWALFlush(buf, false))
+		return false;
+
 	/*
-	 * Remove the dirty buffer from the ring; necessary to prevent infinite
+	 * Remove the dirty buffer from the ring; necessary to prevent an infinite
 	 * loop if all ring members are dirty.
 	 */
 	strategy->buffers[strategy->current] = InvalidBuffer;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 2f607ea2ac5..916be941a41 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "access/xlogdefs.h"
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "storage/aio_types.h"
@@ -522,6 +523,7 @@ extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
 
 extern void TrackNewBufferPin(Buffer buf);
+extern bool BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive_locked);
 
 /* solely to make it easier to write tests */
 extern bool StartBufferIO(BufferDesc *buf, bool forInput, bool nowait);
-- 
2.43.0

From fdfa439c26493493c0e8c2cb0df1a82bf9920e34 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 15 Oct 2025 10:54:19 -0400
Subject: [PATCH v13 2/8] Split FlushBuffer() into two parts

Before adding write combining to write a batch of blocks when flushing
dirty buffers, refactor FlushBuffer() into the preparatory step and
actual buffer flushing step. This separation provides symmetry with
future code for batch flushing which necessarily separates these steps,
as it must prepare multiple buffers before flushing them together.

These steps are moved into a new FlushBuffer() helper function,
CleanVictimBuffer() which will contain both the batch flushing and
single flush code in future commits.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 134 ++++++++++++++++++----------
 1 file changed, 87 insertions(+), 47 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 67f9a210872..0d90b56ec9b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -631,6 +631,10 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
+static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
+						  IOObject io_object, IOContext io_context,
+						  XLogRecPtr buffer_lsn);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2450,6 +2454,7 @@ again:
 	if (buf_state & BM_DIRTY)
 	{
 		LWLock	   *content_lock;
+		XLogRecPtr	victim_lsn;
 
 		Assert(buf_state & BM_TAG_VALID);
 		Assert(buf_state & BM_VALID);
@@ -2493,12 +2498,18 @@ again:
 			goto again;
 		}
 
-		/* OK, do the I/O */
-		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
-		LWLockRelease(content_lock);
-
-		ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-									  &buf_hdr->tag);
+		if (!PrepareFlushBuffer(buf_hdr, &victim_lsn))
+		{
+			/* May be nothing to do if buffer was cleaned */
+			LWLockRelease(BufferDescriptorGetContentLock(buf_hdr));
+		}
+		else
+		{
+			DoFlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context, victim_lsn);
+			LWLockRelease(BufferDescriptorGetContentLock(buf_hdr));
+			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+										  &buf_hdr->tag);
+		}
 	}
 
 
@@ -3421,8 +3432,8 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive_locked)
 	XLogRecPtr	lsn;
 
 	/*
-	 * Unlogged buffers can't need WAL flush. See FlushBuffer() for more
-	 * details on unlogged relations with LSNs.
+	 * Unlogged buffers can't need WAL flush. See PrepareFlushBuffer() for
+	 * more details on unlogged relations with LSNs.
 	 */
 	if (!(buf_state & BM_PERMANENT))
 		return false;
@@ -4406,54 +4417,38 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	recptr;
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
+	XLogRecPtr	lsn;
+
+	if (PrepareFlushBuffer(buf, &lsn))
+		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+}
+
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * actually needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged and still needs flushing.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
 	uint32		buf_state;
 
+	*lsn = InvalidXLogRecPtr;
+
 	/*
 	 * Try to start an I/O operation.  If StartBufferIO returns false, then
 	 * someone else flushed the buffer before we could, so we need not do
 	 * anything.
 	 */
-	if (!StartBufferIO(buf, false, false))
-		return;
-
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
-
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
-
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
-
-	buf_state = LockBufHdr(buf);
-
-	/*
-	 * Run PageGetLSN while holding header lock, since we don't have the
-	 * buffer locked exclusively in all cases.
-	 */
-	recptr = BufferGetLSN(buf);
+	if (!StartBufferIO(bufdesc, false, false))
+		return false;
 
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	UnlockBufHdrExt(buf, buf_state,
-					0, BM_JUST_DIRTIED,
-					0);
+	buf_state = LockBufHdr(bufdesc);
 
 	/*
-	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
-	 * rule that log updates must hit disk before any of the data-file changes
-	 * they describe do.
+	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
+	 * This implements the basic WAL rule that log updates must hit disk
+	 * before any of the data-file changes they describe do.
 	 *
 	 * However, this rule does not apply to unlogged relations, which will be
 	 * lost after a crash anyway.  Most unlogged relation pages do not bear
@@ -4466,9 +4461,54 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * happen, attempting to flush WAL through that location would fail, with
 	 * disastrous system-wide consequences.  To make sure that can't happen,
 	 * skip the flush if the buffer isn't permanent.
+	 *
+	 * We must hold the buffer header lock when examining the page LSN since
+	 * we don't have buffer exclusively locked in all cases.
 	 */
 	if (buf_state & BM_PERMANENT)
-		XLogFlush(recptr);
+		*lsn = BufferGetLSN(bufdesc);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	UnlockBufHdrExt(bufdesc, buf_state,
+					0, BM_JUST_DIRTIED,
+					0);
+	return true;
+}
+
+/*
+ * Actually do the write I/O to clean a buffer. buf and reln may be modified.
+ */
+static void
+DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
+			  IOContext io_context, XLogRecPtr buffer_lsn)
+{
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	/* Force XLOG flush up to buffer's LSN */
+	if (XLogRecPtrIsValid(buffer_lsn))
+	{
+		Assert(pg_atomic_read_u32(&buf->state) & BM_PERMANENT);
+		XLogFlush(buffer_lsn);
+	}
 
 	/*
 	 * Now it's safe to write the buffer to disk. Note that no one else should
-- 
2.43.0

From 3f2845d0bcbc48234f99583b4e2ff1c5220a6fe6 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 7 Jan 2026 14:56:49 -0500
Subject: [PATCH v13 3/8] Eagerly flush bulkwrite strategy ring

Operations using BAS_BULKWRITE (COPY FROM and createdb) will inevitably
need to flush buffers in the strategy ring in order to reuse them. By
eagerly flushing the buffers in a larger run, we encourage larger writes
at the kernel level and less interleaving of WAL flushes and data file
writes. The effect is mainly noticeable with multiple parallel COPY
FROMs. In this case, client backends achieve higher write throughput and
end up spending less time waiting on acquiring the lock to flush WAL.
Larger flush operations also mean less time waiting for flush operations
at the kernel level.

The heuristic for eager eviction is to only flush buffers in the
strategy ring which do not require a WAL flush.

This patch also is a step toward AIO writes, as it lines up multiple
buffers that can be issued asynchronously once the infrastructure
exists.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Earlier version Reviewed-by: Kirill Reshke <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
Discussion: https://postgr.es/m/flat/CAAKRu_Yjn4mvN9NBxtmsCQSGwup45CoA4e05nhR7ADP-v0WCig%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c   | 153 ++++++++++++++++++++++++++
 src/backend/storage/buffer/freelist.c |  48 ++++++++
 src/include/storage/buf_internals.h   |   4 +
 3 files changed, 205 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0d90b56ec9b..cab552bbc56 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -631,6 +631,10 @@ static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
+
+static BufferDesc *PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
+												   Buffer bufnum,
+												   XLogRecPtr *lsn);
 static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
@@ -2503,6 +2507,59 @@ again:
 			/* May be nothing to do if buffer was cleaned */
 			LWLockRelease(BufferDescriptorGetContentLock(buf_hdr));
 		}
+		else if (from_ring && StrategySupportsEagerFlush(strategy))
+		{
+			Buffer		sweep_end = buf;
+			int			cursor = StrategyGetCurrentIndex(strategy);
+			bool		first_buffer = true;
+			BufferDesc *next_bufdesc = buf_hdr;
+
+			/*
+			 * Flush the victim buffer and then loop around strategy ring one
+			 * time eagerly flushing all of the eligible buffers.
+			 */
+			for (;;)
+			{
+				Buffer		next_buf;
+
+				if (next_bufdesc)
+				{
+					DoFlushBuffer(next_bufdesc, NULL, IOOBJECT_RELATION, io_context, victim_lsn);
+					LWLockRelease(BufferDescriptorGetContentLock(next_bufdesc));
+					ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
+												  &next_bufdesc->tag);
+					/* We leave the first buffer pinned for the caller */
+					if (!first_buffer)
+						UnpinBuffer(next_bufdesc);
+					first_buffer = false;
+				}
+
+				next_buf = StrategyNextBuffer(strategy, &cursor);
+
+				/* Completed one sweep of the strategy ring */
+				if (next_buf == sweep_end)
+					break;
+
+				/*
+				 * For strategies currently supporting eager flush
+				 * (BAS_BULKWRITE, eventually BAS_VACUUM), once you hit an
+				 * InvalidBuffer, the remaining buffers in the ring will be
+				 * invalid. If BAS_BULKREAD is someday supported, this logic
+				 * will have to change.
+				 */
+				if (!BufferIsValid(next_buf))
+					break;
+
+				/*
+				 * Check buffer eager flush eligibility. If the buffer is
+				 * ineligible, we'll keep looking until we complete one full
+				 * sweep around the ring.
+				 */
+				next_bufdesc = PrepareOrRejectEagerFlushBuffer(strategy,
+															   next_buf,
+															   &victim_lsn);
+			}
+		}
 		else
 		{
 			DoFlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context, victim_lsn);
@@ -4423,6 +4480,102 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Prepare bufdesc for eager flushing.
+ *
+ * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
+ * pinned and locked and with BM_IO_IN_PROGRESS set, or NULL if this buffer
+ * does not contain a block that should be flushed.
+ *
+ * If returning a buffer, also return its LSN.
+ */
+static BufferDesc *
+PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
+								Buffer bufnum,
+								XLogRecPtr *lsn)
+{
+	BufferDesc *bufdesc;
+	uint32		buf_state;
+	LWLock	   *content_lock;
+
+	*lsn = InvalidXLogRecPtr;
+
+	if (!BufferIsValid(bufnum))
+		goto reject_buffer;
+
+	Assert(!BufferIsLocal(bufnum));
+
+	bufdesc = GetBufferDescriptor(bufnum - 1);
+	buf_state = pg_atomic_read_u32(&bufdesc->state);
+
+	/*
+	 * Quick racy check to see if the buffer is clean, in which case we don't
+	 * need to flush it. We'll recheck if it is dirty again later before
+	 * actually setting BM_IO_IN_PROGRESS.
+	 */
+	if (!(buf_state & BM_DIRTY))
+		goto reject_buffer;
+
+	/*
+	 * Quick check to see if the buffer is pinned, in which case it is more
+	 * likely to be dirtied again soon, and we don't want to eagerly flush it.
+	 * We don't care if it has a non-zero usage count because we don't need to
+	 * reuse it right away and a non-zero usage count doesn't necessarily mean
+	 * it will be dirtied again soon.
+	 */
+	if (BUF_STATE_GET_REFCOUNT(buf_state) > 0)
+		goto reject_buffer;
+
+	/*
+	 * Don't eagerly flush buffers requiring WAL flush. We must check this
+	 * again later while holding the buffer content lock for correctness.
+	 */
+	if (BufferNeedsWALFlush(bufdesc, false))
+		goto reject_buffer;
+
+	/*
+	 * Ensure that there's a free refcount entry and resource owner slot for
+	 * the pin before pinning the buffer. While this may leak a refcount and
+	 * slot if we return without a buffer, that slot will be reused.
+	 */
+	ResourceOwnerEnlarge(CurrentResourceOwner);
+	ReservePrivateRefCountEntry();
+
+	/* There is no need to flush the buffer if it is not BM_VALID */
+	if (!PinBuffer(bufdesc, strategy, /* skip_if_not_valid */ true))
+		goto reject_buffer;
+
+	CheckBufferIsPinnedOnce(bufnum);
+
+	content_lock = BufferDescriptorGetContentLock(bufdesc);
+	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+		goto reject_buffer_unpin;
+
+	/* Now that we have the lock, recheck if it needs WAL flush */
+	if (BufferNeedsWALFlush(bufdesc, false))
+		goto reject_buffer_unlock;
+
+	/* Try to start an I/O operation */
+	if (!StartBufferIO(bufdesc, false, true))
+		goto reject_buffer_unlock;
+
+	/* Need the buffer header spinlock to read the page LSN */
+	buf_state = LockBufHdr(bufdesc);
+	*lsn = BufferGetLSN(bufdesc);
+	UnlockBufHdrExt(bufdesc, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+	return bufdesc;
+
+reject_buffer_unlock:
+	LWLockRelease(content_lock);
+
+reject_buffer_unpin:
+	UnpinBuffer(bufdesc);
+
+reject_buffer:
+	return NULL;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * actually needs writing and false otherwise. lsn returns the buffer's LSN if
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 570b933ddb3..cb804fc43ce 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -155,6 +155,31 @@ ClockSweepTick(void)
 	return victim;
 }
 
+/*
+ * Some BufferAccessStrategies support eager flushing -- which is flushing
+ * buffers in the ring before they are needed. This can lead to better I/O
+ * patterns than lazily flushing buffers immediately before reusing them.
+ */
+bool
+StrategySupportsEagerFlush(BufferAccessStrategy strategy)
+{
+	Assert(strategy);
+
+	switch (strategy->btype)
+	{
+		case BAS_BULKWRITE:
+			return true;
+		case BAS_VACUUM:
+		case BAS_NORMAL:
+		case BAS_BULKREAD:
+			return false;
+		default:
+			elog(ERROR, "unrecognized buffer access strategy: %d",
+				 (int) strategy->btype);
+			return false;
+	}
+}
+
 /*
  * StrategyGetBuffer
  *
@@ -306,6 +331,29 @@ StrategyGetBuffer(BufferAccessStrategy strategy, uint32 *buf_state, bool *from_r
 	}
 }
 
+/*
+ * Returns the next buffer in the ring after the one at cursor and increments
+ * cursor.
+ */
+Buffer
+StrategyNextBuffer(BufferAccessStrategy strategy, int *cursor)
+{
+	if (++(*cursor) >= strategy->nbuffers)
+		*cursor = 0;
+
+	return strategy->buffers[*cursor];
+}
+
+/*
+ * Return the current slot in the strategy ring.
+ */
+int
+StrategyGetCurrentIndex(BufferAccessStrategy strategy)
+{
+	return strategy->current;
+}
+
+
 /*
  * StrategySyncStart -- tell BgBufferSync where to start syncing
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 916be941a41..fef5bc5382b 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -532,6 +532,10 @@ extern void TerminateBufferIO(BufferDesc *buf, bool clear_dirty, uint32 set_flag
 
 
 /* freelist.c */
+extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
+extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
+								 int *cursor);
+extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
-- 
2.43.0

From 622f55505ed2eac23cfa3ce4ece0dc5936fc880b Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 15 Oct 2025 13:42:47 -0400
Subject: [PATCH v13 4/8] Write combining for BAS_BULKWRITE

Implement write combining for users of the bulkwrite buffer access
strategy (e.g. COPY FROM). When the buffer access strategy needs to
clean a buffer for reuse, it already opportunistically flushes some
other buffers. Now, combine any contiguous blocks from the same relation
into larger writes and issue them with smgrwritev().

The performance benefit for COPY FROM is mostly noticeable for multiple
concurrent COPY FROMs because a single COPY FROM is either CPU bound or
bound by WAL writes.

The infrastructure for flushing larger batches of IOs will be reused by
checkpointer and other processes doing writes of dirty data.

XXX: Because this sets in-place checksums for batches, it is not
committable until additional infrastructure goes in place.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/CAAKRu_bcWRvRwZUop_d9vzF9nHAiT%2B-uPzkJ%3DS3ShZ1GqeAYOw%40mail.gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 315 ++++++++++++++++++++++++++--
 src/backend/storage/page/bufpage.c  |  20 ++
 src/backend/utils/probes.d          |   2 +
 src/include/storage/buf_internals.h |  30 +++
 src/include/storage/bufpage.h       |   2 +
 src/tools/pgindent/typedefs.list    |   1 +
 6 files changed, 354 insertions(+), 16 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cab552bbc56..8a1a1b92739 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -626,15 +626,24 @@ static inline BufferDesc *BufferAlloc(SMgrRelation smgr,
 									  bool *foundPtr, IOContext io_context);
 static bool AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress);
 static void CheckReadBuffersOperation(ReadBuffersOperation *operation, bool is_complete);
+static uint32 MaxWriteBatchSize(BufferAccessStrategy strategy);
 static Buffer GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context);
 static void FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 								IOObject io_object, IOContext io_context);
 static void FlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						IOObject io_object, IOContext io_context);
 
+static BlockNumber WriteBatchInit(BufferDesc *batch_start, uint32 max_batch_size,
+								  BufferWriteBatch *batch);
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 												   Buffer bufnum,
+												   BufferTag *require,
 												   XLogRecPtr *lsn);
+static void FindStrategyFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
+									   BufferDesc *batch_start,
+									   uint32 max_batch_size,
+									   BufferWriteBatch *batch,
+									   int *sweep_cursor);
 static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
 static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
 						  IOObject io_object, IOContext io_context,
@@ -2418,6 +2427,34 @@ InvalidateVictimBuffer(BufferDesc *buf_hdr)
 	return true;
 }
 
+/*
+ * Determine the largest IO we can assemble given strategy-specific and global
+ * constraints on the number of pinned buffers and max IO size. Currently only
+ * a single write is inflight at a time, so the batch can consume all the
+ * pinned buffers this backend is allowed. Only for batches of shared
+ * (non-local) relations.
+ */
+static uint32
+MaxWriteBatchSize(BufferAccessStrategy strategy)
+{
+	uint32		result = io_combine_limit;
+	uint32		strategy_pin_limit;
+	uint32		max_pin_limit = GetPinLimit();
+
+	/* Apply pin limits */
+	result = Min(result, max_pin_limit);
+	if (strategy)
+	{
+		strategy_pin_limit = GetAccessStrategyPinLimit(strategy);
+		result = Min(result, strategy_pin_limit);
+	}
+
+	/* Ensure forward progress */
+	result = Max(result, 1);
+
+	return result;
+}
+
 static Buffer
 GetVictimBuffer(BufferAccessStrategy strategy, IOContext io_context)
 {
@@ -2511,27 +2548,48 @@ again:
 		{
 			Buffer		sweep_end = buf;
 			int			cursor = StrategyGetCurrentIndex(strategy);
-			bool		first_buffer = true;
+			uint32		max_batch_size = MaxWriteBatchSize(strategy);
 			BufferDesc *next_bufdesc = buf_hdr;
 
+			/* Pin victim again so it stays ours even after batch released */
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			IncrBufferRefCount(BufferDescriptorGetBuffer(buf_hdr));
+
 			/*
 			 * Flush the victim buffer and then loop around strategy ring one
-			 * time eagerly flushing all of the eligible buffers.
+			 * time eagerly flushing all of the eligible buffers. IO
+			 * concurrency only needs to be taken into account if AIO writes
+			 * are added in the future.
 			 */
 			for (;;)
 			{
 				Buffer		next_buf;
+				XLogRecPtr	next_buf_lsn;	/* unused */
 
 				if (next_bufdesc)
 				{
-					DoFlushBuffer(next_bufdesc, NULL, IOOBJECT_RELATION, io_context, victim_lsn);
-					LWLockRelease(BufferDescriptorGetContentLock(next_bufdesc));
-					ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-												  &next_bufdesc->tag);
-					/* We leave the first buffer pinned for the caller */
-					if (!first_buffer)
-						UnpinBuffer(next_bufdesc);
-					first_buffer = false;
+					BufferWriteBatch batch;
+					BlockNumber limit;
+
+					/*
+					 * After finding an eligible buffer, if we are allowed
+					 * more pins and there are more blocks in the relation,
+					 * identify any of the buffers following it which are also
+					 * eligible and combine them into a batch.
+					 *
+					 * The cursor will be advanced through the ring such that
+					 * the next write batch will start at the next eligible
+					 * buffer after the current batch ends.
+					 */
+					limit = WriteBatchInit(next_bufdesc, max_batch_size, &batch);
+					if (limit > 1)
+						FindStrategyFlushAdjacents(strategy, sweep_end,
+												   next_bufdesc,
+												   limit, &batch, &cursor);
+					FlushBufferBatch(&batch, io_context);
+					/* Pins and locks released inside CompleteWriteBatchIO */
+					CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 				}
 
 				next_buf = StrategyNextBuffer(strategy, &cursor);
@@ -2551,13 +2609,14 @@ again:
 					break;
 
 				/*
-				 * Check buffer eager flush eligibility. If the buffer is
-				 * ineligible, we'll keep looking until we complete one full
-				 * sweep around the ring.
+				 * If the buffer is eligible for eager flushing, it will be
+				 * the start of a new batch.  Otherwise, we'll keep looking
+				 * until we complete one full sweep around the ring.
 				 */
 				next_bufdesc = PrepareOrRejectEagerFlushBuffer(strategy,
 															   next_buf,
-															   &victim_lsn);
+															   NULL,
+															   &next_buf_lsn);
 			}
 		}
 		else
@@ -4480,6 +4539,41 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
 }
 
+/*
+ * Given the starting buffer of a batch, initialize the batch structure. The
+ * starting buffer must be ready to flush.
+ */
+static BlockNumber
+WriteBatchInit(BufferDesc *batch_start, uint32 max_batch_size,
+			   BufferWriteBatch *batch)
+{
+	BlockNumber limit;
+	uint32		buf_state;
+
+	Assert(batch_start);
+	batch->bufdescs[0] = batch_start;
+
+	Assert(BufferIsLockedByMe(BufferDescriptorGetBuffer(batch_start)));
+	buf_state = LockBufHdr(batch_start);
+	batch->max_lsn = buf_state & BM_PERMANENT ?
+		BufferGetLSN(batch_start) : InvalidXLogRecPtr;
+	UnlockBufHdr(batch_start);
+
+	batch->start = batch->bufdescs[0]->tag.blockNum;
+	Assert(BlockNumberIsValid(batch->start));
+	batch->forkno = BufTagGetForkNum(&batch->bufdescs[0]->tag);
+	batch->reln = smgropen(BufTagGetRelFileLocator(&batch->bufdescs[0]->tag),
+						   INVALID_PROC_NUMBER);
+
+	batch->n = 1;
+
+	limit = smgrmaxcombine(batch->reln, batch->forkno, batch->start);
+	limit = Min(max_batch_size, limit);
+	limit = Min(GetAdditionalPinLimit(), limit);
+
+	return limit;
+}
+
 /*
  * Prepare bufdesc for eager flushing.
  *
@@ -4487,11 +4581,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
  * pinned and locked and with BM_IO_IN_PROGRESS set, or NULL if this buffer
  * does not contain a block that should be flushed.
  *
+ * If the caller requires a particular block to be in the buffer in order to
+ * accept it, they will provide the required block number and its
+ * RelFileLocator and fork.
+ *
  * If returning a buffer, also return its LSN.
  */
 static BufferDesc *
 PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 								Buffer bufnum,
+								BufferTag *require,
 								XLogRecPtr *lsn)
 {
 	BufferDesc *bufdesc;
@@ -4506,11 +4605,19 @@ PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 	Assert(!BufferIsLocal(bufnum));
 
 	bufdesc = GetBufferDescriptor(bufnum - 1);
+
+	/*
+	 * Quick, unsafe checks to see if buffer even possibly contains a block
+	 * meeting our requirements. We'll recheck it all again after getting a
+	 * pin.
+	 */
+	if (require && !BufferTagsEqual(require, &bufdesc->tag))
+		goto reject_buffer;
+
 	buf_state = pg_atomic_read_u32(&bufdesc->state);
 
 	/*
-	 * Quick racy check to see if the buffer is clean, in which case we don't
-	 * need to flush it. We'll recheck if it is dirty again later before
+	 * We'll recheck if it is dirty later, when we have a pin and lock, before
 	 * actually setting BM_IO_IN_PROGRESS.
 	 */
 	if (!(buf_state & BM_DIRTY))
@@ -4547,6 +4654,10 @@ PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 
 	CheckBufferIsPinnedOnce(bufnum);
 
+	/* Now that we have the buffer pinned, recheck it's got the right block */
+	if (require && !BufferTagsEqual(require, &bufdesc->tag))
+		goto reject_buffer_unpin;
+
 	content_lock = BufferDescriptorGetContentLock(bufdesc);
 	if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
 		goto reject_buffer_unpin;
@@ -4576,6 +4687,137 @@ reject_buffer:
 	return NULL;
 }
 
+/*
+ * Given a starting buffer descriptor from a strategy ring that supports eager
+ * flushing, find additional buffers from the ring that can be combined into a
+ * single write batch with the starting buffer.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
+ *
+ * batch_limit is the largest batch we are allowed to construct given the
+ * remaining blocks in the table, the number of available pins, and the
+ * current configuration.
+ *
+ * batch is an output parameter that this function will fill with the needed
+ * information to issue this IO.
+ */
+static void
+FindStrategyFlushAdjacents(BufferAccessStrategy strategy,
+						   Buffer sweep_end,
+						   BufferDesc *batch_start,
+						   uint32 batch_limit,
+						   BufferWriteBatch *batch,
+						   int *sweep_cursor)
+{
+	BufferTag	require;
+
+	Assert(batch_limit > 1);
+
+	InitBufferTag(&require, &batch->reln->smgr_rlocator.locator,
+				  batch->forkno, InvalidBlockNumber);
+
+	/* Now assemble a run of blocks to write out. */
+	for (; batch->n < batch_limit; batch->n++)
+	{
+		Buffer		bufnum;
+		XLogRecPtr	lsn;
+
+		if ((bufnum =
+			 StrategyNextBuffer(strategy, sweep_cursor)) == sweep_end)
+			break;
+
+		/*
+		 * For BAS_BULKWRITE, once you hit an InvalidBuffer, the remaining
+		 * buffers in the ring will be invalid.
+		 */
+		if (!BufferIsValid(bufnum))
+			break;
+
+		require.blockNum = batch->start + batch->n;
+
+		batch->bufdescs[batch->n] =
+			PrepareOrRejectEagerFlushBuffer(strategy, bufnum,
+											&require,
+											&lsn);
+
+		/*
+		 * Because we don't eagerly flush buffers that need WAL flushed, this
+		 * buffer's LSN should only be greater than the victim buffer LSN if
+		 * the victim doesn't need WAL flushing either -- in which case, we
+		 * don't really need to update max_lsn. But, it seems better to keep
+		 * the max_lsn honest -- especially since doing so is cheap.
+		 */
+		if (lsn > batch->max_lsn)
+			batch->max_lsn = lsn;
+
+		/* Stop when we encounter a buffer that will break the run */
+		if (batch->bufdescs[batch->n] == NULL)
+			break;
+
+	}
+}
+
+/*
+ * Given a prepared batch of buffers write them out as a vector.
+ */
+void
+FlushBufferBatch(BufferWriteBatch *batch,
+				 IOContext io_context)
+{
+	BlockNumber blknums[MAX_IO_COMBINE_LIMIT];
+	Block		blocks[MAX_IO_COMBINE_LIMIT];
+	instr_time	io_start;
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+
+	if (XLogRecPtrIsValid(batch->max_lsn))
+		XLogFlush(batch->max_lsn);
+
+	/* Should have been opened when initializing the batch */
+	Assert(batch->reln);
+
+#ifdef USE_ASSERT_CHECKING
+	for (uint32 i = 0; i < batch->n; i++)
+		Assert(!BufferNeedsWALFlush(batch->bufdescs[i], false));
+#endif
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_START(batch->forkno,
+											  batch->reln->smgr_rlocator.locator.spcOid,
+											  batch->reln->smgr_rlocator.locator.dbOid,
+											  batch->reln->smgr_rlocator.locator.relNumber,
+											  batch->reln->smgr_rlocator.backend,
+											  batch->n);
+
+	/*
+	 * XXX: All blocks should be copied and then checksummed but doing so
+	 * takes a lot of extra memory and a future patch will eliminate this
+	 * requirement.
+	 */
+	for (BlockNumber i = 0; i < batch->n; i++)
+	{
+		blknums[i] = batch->start + i;
+		blocks[i] = BufHdrGetBlock(batch->bufdescs[i]);
+	}
+
+	PageSetBatchChecksumInplace((Page *) blocks, blknums, batch->n);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	smgrwritev(batch->reln, batch->forkno,
+			   batch->start, (const void **) blocks, batch->n, false);
+
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context, IOOP_WRITE,
+							io_start, batch->n, BLCKSZ);
+
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * Prepare the buffer with bufdesc for writing. Returns true if the buffer
  * actually needs writing and false otherwise. lsn returns the buffer's LSN if
@@ -4741,6 +4983,47 @@ FlushUnlockedBuffer(BufferDesc *buf, SMgrRelation reln,
 	LWLockRelease(BufferDescriptorGetContentLock(buf));
 }
 
+/*
+ * Given a previously initialized batch with buffers that have already been
+ * flushed, terminate the IO on each buffer and then unlock and unpin them.
+ * This assumes all the buffers were locked and pinned. wb_context will be
+ * modified.
+ */
+void
+CompleteWriteBatchIO(BufferWriteBatch *batch, IOContext io_context,
+					 WritebackContext *wb_context)
+{
+	ErrorContextCallback errcallback =
+	{
+		.callback = shared_buffer_write_error_callback,
+		.previous = error_context_stack,
+	};
+
+	error_context_stack = &errcallback;
+	pgBufferUsage.shared_blks_written += batch->n;
+
+	for (uint32 i = 0; i < batch->n; i++)
+	{
+		Buffer		buffer = BufferDescriptorGetBuffer(batch->bufdescs[i]);
+
+		errcallback.arg = batch->bufdescs[i];
+
+		/* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state. */
+		TerminateBufferIO(batch->bufdescs[i], true, 0, true, false);
+		UnlockReleaseBuffer(buffer);
+		ScheduleBufferTagForWriteback(wb_context, io_context,
+									  &batch->bufdescs[i]->tag);
+	}
+
+	TRACE_POSTGRESQL_BUFFER_BATCH_FLUSH_DONE(batch->forkno,
+											 batch->reln->smgr_rlocator.locator.spcOid,
+											 batch->reln->smgr_rlocator.locator.dbOid,
+											 batch->reln->smgr_rlocator.locator.relNumber,
+											 batch->reln->smgr_rlocator.backend,
+											 batch->n, batch->start);
+	error_context_stack = errcallback.previous;
+}
+
 /*
  * RelationGetNumberOfBlocksInFork
  *		Determines the current number of pages in the specified relation fork.
diff --git a/src/backend/storage/page/bufpage.c b/src/backend/storage/page/bufpage.c
index de85911e3ac..4d0f1883a26 100644
--- a/src/backend/storage/page/bufpage.c
+++ b/src/backend/storage/page/bufpage.c
@@ -1546,3 +1546,23 @@ PageSetChecksumInplace(Page page, BlockNumber blkno)
 
 	((PageHeader) page)->pd_checksum = pg_checksum_page(page, blkno);
 }
+
+/*
+ * A helper to set multiple blocks' checksums
+ */
+void
+PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos, uint32 length)
+{
+	/* If we don't need a checksum, just return */
+	if (!DataChecksumsEnabled())
+		return;
+
+	for (uint32 i = 0; i < length; i++)
+	{
+		Page		page = pages[i];
+
+		if (PageIsNew(page))
+			continue;
+		((PageHeader) page)->pd_checksum = pg_checksum_page(page, blknos[i]);
+	}
+}
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index 1929521c6a5..e0f48c6d2d9 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -61,6 +61,8 @@ provider postgresql {
 	probe buffer__flush__done(ForkNumber, BlockNumber, Oid, Oid, Oid);
 	probe buffer__extend__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
 	probe buffer__extend__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
+	probe buffer__batch__flush__start(ForkNumber, Oid, Oid, Oid, int, unsigned int);
+	probe buffer__batch__flush__done(ForkNumber, Oid, Oid, Oid, int, unsigned int, BlockNumber);
 
 	probe buffer__checkpoint__start(int);
 	probe buffer__checkpoint__sync__start();
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index fef5bc5382b..e9d8f3e6810 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -513,6 +513,33 @@ ResourceOwnerForgetBufferIO(ResourceOwner owner, Buffer buffer)
 	ResourceOwnerForget(owner, Int32GetDatum(buffer), &buffer_io_resowner_desc);
 }
 
+/*
+ * Used to write out multiple blocks at a time in a combined IO. bufdescs
+ * contains buffer descriptors for buffers containing adjacent blocks of the
+ * same fork of the same relation.
+ */
+typedef struct BufferWriteBatch
+{
+	ForkNumber	forkno;
+	SMgrRelation reln;
+
+	/*
+	 * The BlockNumber of the first block in the run of contiguous blocks to
+	 * be written out as a single IO.
+	 */
+	BlockNumber start;
+
+	/*
+	 * While assembling the buffers, we keep track of the maximum LSN so that
+	 * we can flush WAL through this LSN before flushing the buffers.
+	 */
+	XLogRecPtr	max_lsn;
+
+	/* The number of valid buffers in bufdescs */
+	uint32		n;
+	BufferDesc *bufdescs[MAX_IO_COMBINE_LIMIT];
+} BufferWriteBatch;
+
 /*
  * Internal buffer management routines
  */
@@ -521,6 +548,7 @@ extern void WritebackContextInit(WritebackContext *context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *wb_context,
 										  IOContext io_context, BufferTag *tag);
+extern void FlushBufferBatch(BufferWriteBatch *batch, IOContext io_context);
 
 extern void TrackNewBufferPin(Buffer buf);
 extern bool BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive_locked);
@@ -536,6 +564,8 @@ extern bool StrategySupportsEagerFlush(BufferAccessStrategy strategy);
 extern Buffer StrategyNextBuffer(BufferAccessStrategy strategy,
 								 int *cursor);
 extern int	StrategyGetCurrentIndex(BufferAccessStrategy strategy);
+extern void CompleteWriteBatchIO(BufferWriteBatch *batch, IOContext io_context,
+								 WritebackContext *wb_context);
 extern IOContext IOContextForStrategy(BufferAccessStrategy strategy);
 extern BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy,
 									 uint32 *buf_state, bool *from_ring);
diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h
index ae3725b3b81..baadfc6c313 100644
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -506,5 +506,7 @@ extern bool PageIndexTupleOverwrite(Page page, OffsetNumber offnum,
 									const void *newtup, Size newsize);
 extern char *PageSetChecksumCopy(Page page, BlockNumber blkno);
 extern void PageSetChecksumInplace(Page page, BlockNumber blkno);
+extern void PageSetBatchChecksumInplace(Page *pages, const BlockNumber *blknos,
+										uint32 length);
 
 #endif							/* BUFPAGE_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 14dec2d49c1..bfdc6689fbf 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -359,6 +359,7 @@ BufferManagerRelation
 BufferStrategyControl
 BufferTag
 BufferUsage
+BufferWriteBatch
 BuildAccumulator
 BuiltinScript
 BulkInsertState
-- 
2.43.0

From a1ca838809821283240e2ba574a7a02b7150b8d3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Tue, 2 Sep 2025 15:22:11 -0400
Subject: [PATCH v13 5/8] Add database Oid to CkptSortItem

This is an oversight that currently isn't causing harm. However, it is
required for checkpointer write combining -- which will be added in
a future commit.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 8 ++++++++
 src/include/storage/buf_internals.h | 1 +
 2 files changed, 9 insertions(+)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8a1a1b92739..fc8c9b366a2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3650,6 +3650,7 @@ BufferSync(int flags)
 			item = &CkptBufferIds[num_to_scan++];
 			item->buf_id = buf_id;
 			item->tsId = bufHdr->tag.spcOid;
+			item->dbId = bufHdr->tag.dbOid;
 			item->relNumber = BufTagGetRelNumber(&bufHdr->tag);
 			item->forkNum = BufTagGetForkNum(&bufHdr->tag);
 			item->blockNum = bufHdr->tag.blockNum;
@@ -6973,6 +6974,13 @@ ckpt_buforder_comparator(const CkptSortItem *a, const CkptSortItem *b)
 		return -1;
 	else if (a->tsId > b->tsId)
 		return 1;
+
+	/* compare database */
+	if (a->dbId < b->dbId)
+		return -1;
+	else if (a->dbId > b->dbId)
+		return 1;
+
 	/* compare relation */
 	if (a->relNumber < b->relNumber)
 		return -1;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index e9d8f3e6810..90281258518 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -479,6 +479,7 @@ extern uint32 WaitBufHdrUnlocked(BufferDesc *buf);
 typedef struct CkptSortItem
 {
 	Oid			tsId;
+	Oid			dbId;
 	RelFileNumber relNumber;
 	ForkNumber	forkNum;
 	BlockNumber blockNum;
-- 
2.43.0

From beb4f90881d3d05f3c1d8c1fbde6a03aa46cb9e1 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 15 Oct 2025 15:23:16 -0400
Subject: [PATCH v13 6/8] Implement checkpointer data write combining

When the checkpointer writes out dirty buffers, writing multiple
contiguous blocks as a single IO is a substantial performance
improvement. The checkpointer is usually bottlenecked on IO, so issuing
larger IOs leads to increased write throughput and faster checkpoints.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Soumya <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 232 ++++++++++++++++++++++++----
 src/backend/utils/probes.d          |   2 +-
 2 files changed, 204 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fc8c9b366a2..a08a0afe2c1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3591,8 +3591,6 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive_locked)
 static void
 BufferSync(int flags)
 {
-	uint32		buf_state;
-	int			buf_id;
 	int			num_to_scan;
 	int			num_spaces;
 	int			num_processed;
@@ -3603,6 +3601,8 @@ BufferSync(int flags)
 	int			i;
 	uint32		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint32		max_batch_size;
+	BufferWriteBatch batch;
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3630,10 +3630,11 @@ BufferSync(int flags)
 	 * certainly need to be written for the next checkpoint attempt, too.
 	 */
 	num_to_scan = 0;
-	for (buf_id = 0; buf_id < NBuffers; buf_id++)
+	for (int buf_id = 0; buf_id < NBuffers; buf_id++)
 	{
 		BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 		uint32		set_bits = 0;
+		uint32		buf_state;
 
 		/*
 		 * Header spinlock is enough to examine BM_DIRTY, see comment in
@@ -3776,48 +3777,221 @@ BufferSync(int flags)
 	 */
 	num_processed = 0;
 	num_written = 0;
+	max_batch_size = MaxWriteBatchSize(NULL);
 	while (!binaryheap_empty(ts_heap))
 	{
+		BlockNumber limit = max_batch_size;
 		BufferDesc *bufHdr = NULL;
 		CkptTsStatus *ts_stat = (CkptTsStatus *)
 			DatumGetPointer(binaryheap_first(ts_heap));
+		int			ts_end = ts_stat->index - ts_stat->num_scanned + ts_stat->num_to_scan;
+		int			processed = 0;
 
-		buf_id = CkptBufferIds[ts_stat->index].buf_id;
-		Assert(buf_id != -1);
+		batch.start = InvalidBlockNumber;
+		batch.max_lsn = InvalidXLogRecPtr;
+		batch.n = 0;
 
-		bufHdr = GetBufferDescriptor(buf_id);
+		while (batch.n < limit)
+		{
+			uint32		buf_state;
+			XLogRecPtr	lsn = InvalidXLogRecPtr;
+			LWLock	   *content_lock;
+			CkptSortItem item;
+			Buffer		buffer;
 
-		num_processed++;
+			if (ProcSignalBarrierPending)
+				ProcessProcSignalBarrier();
 
-		/*
-		 * We don't need to acquire the lock here, because we're only looking
-		 * at a single bit. It's possible that someone else writes the buffer
-		 * and clears the flag right after we check, but that doesn't matter
-		 * since SyncOneBuffer will then do nothing.  However, there is a
-		 * further race condition: it's conceivable that between the time we
-		 * examine the bit here and the time SyncOneBuffer acquires the lock,
-		 * someone else not only wrote the buffer but replaced it with another
-		 * page and dirtied it.  In that improbable case, SyncOneBuffer will
-		 * write the buffer though we didn't need to.  It doesn't seem worth
-		 * guarding against this, though.
-		 */
-		if (pg_atomic_read_u32(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
-		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
+			/* Check if we are done with this tablespace */
+			if (ts_stat->index + processed >= ts_end)
+				break;
+
+			item = CkptBufferIds[ts_stat->index + processed];
+
+			Assert(item.buf_id != -1);
+
+			bufHdr = GetBufferDescriptor(item.buf_id);
+			buffer = BufferDescriptorGetBuffer(bufHdr);
+
+			/*
+			 * If this is the first block of the batch, then check if we need
+			 * to open a new relation. Open the relation now because we have
+			 * to determine the maximum IO size based on how many blocks
+			 * remain in the file.
+			 */
+			if (!BlockNumberIsValid(batch.start))
+			{
+				RelFileLocator rlocator = {
+					.spcOid = item.tsId,
+					.dbOid = item.dbId,
+					.relNumber = item.relNumber
+				};
+
+				Assert(batch.max_lsn == InvalidXLogRecPtr && batch.n == 0);
+				batch.forkno = item.forkNum;
+				batch.start = item.blockNum;
+				batch.reln = smgropen(rlocator, INVALID_PROC_NUMBER);
+				limit = smgrmaxcombine(batch.reln, batch.forkno, batch.start);
+				limit = Min(max_batch_size, limit);
+				limit = Min(GetAdditionalPinLimit(), limit);
+				/* Guarantee progress */
+				limit = Max(limit, 1);
+			}
+
+			/*
+			 * Once we hit blocks from the next relation or fork of the
+			 * relation, break out of the loop and issue the IO we've built up
+			 * so far. It is important that we don't increment processed
+			 * because we want to start the next IO with this item.
+			 */
+			if (item.dbId != batch.reln->smgr_rlocator.locator.dbOid ||
+				item.relNumber != batch.reln->smgr_rlocator.locator.relNumber ||
+				item.forkNum != batch.forkno)
+				break;
+
+			Assert(item.tsId == batch.reln->smgr_rlocator.locator.spcOid);
+
+			/*
+			 * If the next block is not contiguous, we can't include it in the
+			 * IO we will issue. Break out of the loop and issue what we have
+			 * so far. Do not count this item as processed -- otherwise we
+			 * will end up skipping it.
+			 */
+			if (item.blockNum != batch.start + batch.n)
+				break;
+
+			/*
+			 * We don't need to acquire the lock here, because we're only
+			 * looking at a few bits. It's possible that someone else writes
+			 * the buffer and clears the flag right after we check, but that
+			 * doesn't matter since StartBufferIO will then return false.
+			 *
+			 * If the buffer doesn't need checkpointing, don't include it in
+			 * the batch we are building. And if the buffer doesn't need
+			 * flushing, we're done with the item, so count it as processed
+			 * and break out of the loop to issue the IO so far.
+			 */
+			buf_state = pg_atomic_read_u32(&bufHdr->state);
+			if ((buf_state & (BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY)) !=
+				(BM_CHECKPOINT_NEEDED | BM_VALID | BM_DIRTY))
+			{
+				processed++;
+				break;
+			}
+
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+
+			/* If the buffer is not BM_VALID, nothing to do on this buffer */
+			if (!PinBuffer(bufHdr, NULL, true))
+			{
+				processed++;
+				break;
+			}
+
+			/*
+			 * Now that we have a pin, we must recheck that the buffer
+			 * contains the specified block. Someone may have replaced the
+			 * block in the buffer with a different block. In that case, count
+			 * it as processed and issue the IO so far. These fields won't
+			 * change as long as we hold a pin, so we don't need a spinlock to
+			 * read them.
+			 */
+			if (!BufTagMatchesRelFileLocator(&bufHdr->tag,
+											 &batch.reln->smgr_rlocator.locator) ||
+				BufTagGetForkNum(&bufHdr->tag) != batch.forkno ||
+				BufferGetBlockNumber(buffer) != batch.start + batch.n)
 			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
+				UnpinBuffer(bufHdr);
+				processed++;
+				break;
 			}
+
+			/*
+			 * It's conceivable that between the time we examine the buffer
+			 * header for BM_CHECKPOINT_NEEDED above and when we are now
+			 * acquiring the lock that someone else wrote the buffer out. In
+			 * that improbable case, we will write the buffer though we didn't
+			 * need to. It doesn't seem worth guarding against this, though.
+			 */
+			content_lock = BufferDescriptorGetContentLock(bufHdr);
+
+			/*
+			 * We are willing to wait for the content lock on the first IO in
+			 * the batch. However, for subsequent IOs, waiting could lead to
+			 * deadlock. We have to eventually flush all eligible buffers,
+			 * though. So, if we fail to acquire the lock on a subsequent
+			 * buffer, we break out and issue the IO we've built up so far.
+			 * Then we come back and start a new IO with that buffer as the
+			 * starting buffer. As such, we must not count the item as
+			 * processed if we end up failing to acquire the content lock.
+			 */
+			if (batch.n == 0)
+				LWLockAcquire(content_lock, LW_SHARED);
+			else if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
+			{
+				UnpinBuffer(bufHdr);
+				break;
+			}
+
+			/*
+			 * If the buffer doesn't need IO, count the item as processed,
+			 * release the buffer, and break out of the loop to issue the IO
+			 * we have built up so far.
+			 */
+			if (!StartBufferIO(bufHdr, false, true))
+			{
+				LWLockRelease(content_lock);
+				UnpinBuffer(bufHdr);
+				processed++;
+				break;
+			}
+
+			/*
+			 * Lock buffer header lock before examining LSN because we only
+			 * have a shared lock on the buffer.
+			 */
+			buf_state = LockBufHdr(bufHdr);
+			lsn = BufferGetLSN(bufHdr);
+			UnlockBufHdrExt(bufHdr, buf_state, 0, BM_JUST_DIRTIED, 0);
+
+			/*
+			 * Keep track of the max LSN so that we can be sure to flush
+			 * enough WAL before flushing data from the buffers. See comment
+			 * in DoFlushBuffer() for more on why we don't consider the LSNs
+			 * of unlogged relations.
+			 */
+			if (buf_state & BM_PERMANENT && lsn > batch.max_lsn)
+				batch.max_lsn = lsn;
+
+			batch.bufdescs[batch.n++] = bufHdr;
+			processed++;
 		}
 
 		/*
 		 * Measure progress independent of actually having to flush the buffer
-		 * - otherwise writing become unbalanced.
+		 * - otherwise writing becomes unbalanced.
 		 */
-		ts_stat->progress += ts_stat->progress_slice;
-		ts_stat->num_scanned++;
-		ts_stat->index++;
+		num_processed += processed;
+		ts_stat->progress += ts_stat->progress_slice * processed;
+		ts_stat->num_scanned += processed;
+		ts_stat->index += processed;
+
+		/*
+		 * If we built up an IO, issue it. There's a chance we didn't find any
+		 * items referencing buffers that needed flushing this time, but we
+		 * still want to check if we should update the heap if we examined and
+		 * processed the items.
+		 */
+		if (batch.n > 0)
+		{
+			FlushBufferBatch(&batch, IOCONTEXT_NORMAL);
+			CompleteWriteBatchIO(&batch, IOCONTEXT_NORMAL, &wb_context);
+
+			TRACE_POSTGRESQL_BUFFER_BATCH_SYNC_WRITTEN(batch.n);
+			PendingCheckpointerStats.buffers_written += batch.n;
+			num_written += batch.n;
+		}
 
 		/* Have all the buffers from the tablespace been processed? */
 		if (ts_stat->num_scanned == ts_stat->num_to_scan)
diff --git a/src/backend/utils/probes.d b/src/backend/utils/probes.d
index e0f48c6d2d9..90169c92c26 100644
--- a/src/backend/utils/probes.d
+++ b/src/backend/utils/probes.d
@@ -68,7 +68,7 @@ provider postgresql {
 	probe buffer__checkpoint__sync__start();
 	probe buffer__checkpoint__done();
 	probe buffer__sync__start(int, int);
-	probe buffer__sync__written(int);
+	probe buffer__batch__sync__written(BlockNumber);
 	probe buffer__sync__done(int, int, int);
 
 	probe deadlock__found();
-- 
2.43.0

From 974fff42b4ad9860e673e593d03dc1b971ecd615 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Wed, 15 Oct 2025 16:16:58 -0400
Subject: [PATCH v13 7/8] Refactor SyncOneBuffer for bgwriter use only

Since xxx, only bgwriter uses SyncOneBuffer, so we can remove the
skip_recently_used parameter and make that behavior the default.

While we are at it, 5e89985928795f243 introduced the pattern of using a
CAS loop instead of locking the buffer header and then calling
PinBuffer_Locked(). Do that in SyncOneBuffer() so we can avoid taking
the buffer header spinlock in the common case that the buffer is
recently used.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/2FA0BAC7-5413-4ABD-94CA-4398FE77750D%40gmail.com
---
 src/backend/storage/buffer/bufmgr.c | 96 +++++++++++++++++------------
 1 file changed, 56 insertions(+), 40 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a08a0afe2c1..dd75deb13c1 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -612,8 +612,7 @@ static void PinBuffer_Locked(BufferDesc *buf);
 static void UnpinBuffer(BufferDesc *buf);
 static void UnpinBufferNoOwner(BufferDesc *buf);
 static void BufferSync(int flags);
-static int	SyncOneBuffer(int buf_id, bool skip_recently_used,
-						  WritebackContext *wb_context);
+static int	SyncOneBuffer(int buf_id, WritebackContext *wb_context);
 static void WaitIO(BufferDesc *buf);
 static void AbortBufferIO(Buffer buffer);
 static void shared_buffer_write_error_callback(void *arg);
@@ -4265,8 +4264,7 @@ BgBufferSync(WritebackContext *wb_context)
 	/* Execute the LRU scan */
 	while (num_to_scan > 0 && reusable_buffers < upcoming_alloc_est)
 	{
-		int			sync_state = SyncOneBuffer(next_to_clean, true,
-											   wb_context);
+		int			sync_state = SyncOneBuffer(next_to_clean, wb_context);
 
 		if (++next_to_clean >= NBuffers)
 		{
@@ -4329,8 +4327,8 @@ BgBufferSync(WritebackContext *wb_context)
 /*
  * SyncOneBuffer -- process a single buffer during syncing.
  *
- * If skip_recently_used is true, we don't write currently-pinned buffers, nor
- * buffers marked recently used, as these are not replacement candidates.
+ * We don't write currently-pinned buffers, nor buffers marked recently used,
+ * as these are not replacement candidates.
  *
  * Returns a bitmask containing the following flag bits:
  *	BUF_WRITTEN: we wrote the buffer.
@@ -4341,53 +4339,71 @@ BgBufferSync(WritebackContext *wb_context)
  * after locking it, but we don't care all that much.)
  */
 static int
-SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
+SyncOneBuffer(int buf_id, WritebackContext *wb_context)
 {
 	BufferDesc *bufHdr = GetBufferDescriptor(buf_id);
 	int			result = 0;
+	uint32		old_buf_state;
 	uint32		buf_state;
 	BufferTag	tag;
 
-	/* Make sure we can handle the pin */
-	ReservePrivateRefCountEntry();
-	ResourceOwnerEnlarge(CurrentResourceOwner);
-
 	/*
-	 * Check whether buffer needs writing.
-	 *
-	 * We can make this check without taking the buffer content lock so long
-	 * as we mark pages dirty in access methods *before* logging changes with
-	 * XLogInsert(): if someone marks the buffer dirty just after our check we
-	 * don't worry because our checkpoint.redo points before log record for
-	 * upcoming changes and so we are not required to write such dirty buffer.
+	 * Check whether the buffer can be used and pin it if so. Do this using a
+	 * CAS loop, to avoid having to lock the buffer header.
 	 */
-	buf_state = LockBufHdr(bufHdr);
-
-	if (BUF_STATE_GET_REFCOUNT(buf_state) == 0 &&
-		BUF_STATE_GET_USAGECOUNT(buf_state) == 0)
+	old_buf_state = pg_atomic_read_u32(&bufHdr->state);
+	for (;;)
 	{
+		buf_state = old_buf_state;
+
+		/*
+		 * We can make these checks without taking the buffer content lock so
+		 * long as we mark pages dirty in access methods *before* logging
+		 * changes with XLogInsert(): if someone marks the buffer dirty just
+		 * after our check we don't worry because our checkpoint.redo points
+		 * before log record for upcoming changes and so we are not required
+		 * to write such dirty buffer.
+		 */
+		if (BUF_STATE_GET_REFCOUNT(buf_state) != 0 ||
+			BUF_STATE_GET_USAGECOUNT(buf_state) != 0)
+		{
+			/* Don't write recently-used buffers */
+			return result;
+		}
+
 		result |= BUF_REUSABLE;
-	}
-	else if (skip_recently_used)
-	{
-		/* Caller told us not to write recently-used buffers */
-		UnlockBufHdr(bufHdr);
-		return result;
-	}
 
-	if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
-	{
-		/* It's clean, so nothing to do */
-		UnlockBufHdr(bufHdr);
-		return result;
+		if (!(buf_state & BM_VALID) || !(buf_state & BM_DIRTY))
+		{
+			/* It's clean, so nothing to do */
+			return result;
+		}
+
+		if (unlikely(buf_state & BM_LOCKED))
+		{
+			old_buf_state = WaitBufHdrUnlocked(bufHdr);
+			continue;
+		}
+
+		/* Make sure we can handle the pin */
+		ReservePrivateRefCountEntry();
+		ResourceOwnerEnlarge(CurrentResourceOwner);
+
+		/* pin the buffer if the CAS succeeds */
+		buf_state += BUF_REFCOUNT_ONE;
+
+		if (pg_atomic_compare_exchange_u32(&bufHdr->state, &old_buf_state,
+										   buf_state))
+		{
+			TrackNewBufferPin(BufferDescriptorGetBuffer(bufHdr));
+			break;
+		}
 	}
 
 	/*
-	 * Pin it, share-lock it, write it.  (FlushBuffer will do nothing if the
-	 * buffer is clean by the time we've locked it.)
+	 * Share lock and write it out (FlushBuffer will do nothing if the buffer
+	 * is clean by the time we've locked it.)
 	 */
-	PinBuffer_Locked(bufHdr);
-
 	FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
 
 	tag = bufHdr->tag;
@@ -4395,8 +4411,8 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 	UnpinBuffer(bufHdr);
 
 	/*
-	 * SyncOneBuffer() is only called by checkpointer and bgwriter, so
-	 * IOContext will always be IOCONTEXT_NORMAL.
+	 * SyncOneBuffer() is only called by bgwriter, so IOContext will always be
+	 * IOCONTEXT_NORMAL.
 	 */
 	ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &tag);
 
-- 
2.43.0

From a41017e3d44ffe0aca3562a6ff729b2ea8ea9220 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <[email protected]>
Date: Mon, 12 Jan 2026 11:49:41 -0500
Subject: [PATCH v13 8/8] Eagerly flush buffer successors

When flushing a dirty buffer, check if the two blocks following it are
in shared buffers and whether or not they are dirty. If they are, flush
them together with the victim buffer.

Author: Melanie Plageman <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/flat/0198DBB9-4A76-49E4-87F8-43D46DD0FD76%40gmail.com#1d8677fc75dc8b39f0eb5dd6bbafb65d
---
 src/backend/storage/buffer/bufmgr.c | 412 ++++++++++++++++------------
 1 file changed, 238 insertions(+), 174 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dd75deb13c1..c185928cb39 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -637,16 +637,16 @@ static BlockNumber WriteBatchInit(BufferDesc *batch_start, uint32 max_batch_size
 static BufferDesc *PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 												   Buffer bufnum,
 												   BufferTag *require,
+												   LWLock *buftable_lock,
 												   XLogRecPtr *lsn);
+static void FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *batch_start,
+							   uint32 batch_limit,
+							   BufferWriteBatch *batch);
 static void FindStrategyFlushAdjacents(BufferAccessStrategy strategy, Buffer sweep_end,
 									   BufferDesc *batch_start,
 									   uint32 max_batch_size,
 									   BufferWriteBatch *batch,
 									   int *sweep_cursor);
-static bool PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn);
-static void DoFlushBuffer(BufferDesc *buf, SMgrRelation reln,
-						  IOObject io_object, IOContext io_context,
-						  XLogRecPtr buffer_lsn);
 static void FindAndDropRelationBuffers(RelFileLocator rlocator,
 									   ForkNumber forkNum,
 									   BlockNumber nForkBlock,
@@ -2494,7 +2494,6 @@ again:
 	if (buf_state & BM_DIRTY)
 	{
 		LWLock	   *content_lock;
-		XLogRecPtr	victim_lsn;
 
 		Assert(buf_state & BM_TAG_VALID);
 		Assert(buf_state & BM_VALID);
@@ -2538,7 +2537,7 @@ again:
 			goto again;
 		}
 
-		if (!PrepareFlushBuffer(buf_hdr, &victim_lsn))
+		if (!StartBufferIO(buf_hdr, false, false))
 		{
 			/* May be nothing to do if buffer was cleaned */
 			LWLockRelease(BufferDescriptorGetContentLock(buf_hdr));
@@ -2571,6 +2570,9 @@ again:
 					BufferWriteBatch batch;
 					BlockNumber limit;
 
+					/* So we can detect block content changes while flushing */
+					pg_atomic_fetch_and_u32(&next_bufdesc->state, ~BM_JUST_DIRTIED);
+
 					/*
 					 * After finding an eligible buffer, if we are allowed
 					 * more pins and there are more blocks in the relation,
@@ -2615,15 +2617,34 @@ again:
 				next_bufdesc = PrepareOrRejectEagerFlushBuffer(strategy,
 															   next_buf,
 															   NULL,
+															   NULL,
 															   &next_buf_lsn);
 			}
 		}
 		else
 		{
-			DoFlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context, victim_lsn);
-			LWLockRelease(BufferDescriptorGetContentLock(buf_hdr));
-			ScheduleBufferTagForWriteback(&BackendWritebackContext, io_context,
-										  &buf_hdr->tag);
+			BufferWriteBatch batch;
+			BlockNumber limit;
+			uint32		max_batch_size = 3; /* we only look for two successors */
+
+			/* Pin victim again so it stays ours even after batch released */
+			ReservePrivateRefCountEntry();
+			ResourceOwnerEnlarge(CurrentResourceOwner);
+			IncrBufferRefCount(BufferDescriptorGetBuffer(buf_hdr));
+
+			/* So we can detect block content changes while flushing */
+			pg_atomic_fetch_and_u32(&buf_hdr->state, ~BM_JUST_DIRTIED);
+
+			/*
+			 * If we are allowed more pins and there are more blocks in the
+			 * relation and the victim buffer's block's successors are
+			 * eligible for eager flushing, combine them into a batch.
+			 */
+			limit = WriteBatchInit(buf_hdr, max_batch_size, &batch);
+			if (limit > 1)
+				FindFlushAdjacents(strategy, buf_hdr, limit, &batch);
+			FlushBufferBatch(&batch, io_context);
+			CompleteWriteBatchIO(&batch, io_context, &BackendWritebackContext);
 		}
 	}
 
@@ -3547,8 +3568,8 @@ BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive_locked)
 	XLogRecPtr	lsn;
 
 	/*
-	 * Unlogged buffers can't need WAL flush. See PrepareFlushBuffer() for
-	 * more details on unlogged relations with LSNs.
+	 * Unlogged buffers can't need WAL flush. See FlushBuffer() for more
+	 * details on unlogged relations with LSNs.
 	 */
 	if (!(buf_state & BM_PERMANENT))
 		return false;
@@ -4724,10 +4745,133 @@ static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 			IOContext io_context)
 {
-	XLogRecPtr	lsn;
+	XLogRecPtr	recptr;
+	ErrorContextCallback errcallback;
+	instr_time	io_start;
+	Block		bufBlock;
+	char	   *bufToWrite;
+	uint32		buf_state;
+
+	/*
+	 * Try to start an I/O operation.  If StartBufferIO returns false, then
+	 * someone else flushed the buffer before we could, so we need not do
+	 * anything.
+	 */
+	if (!StartBufferIO(buf, false, false))
+		return;
+
+	/* Setup error traceback support for ereport() */
+	errcallback.callback = shared_buffer_write_error_callback;
+	errcallback.arg = buf;
+	errcallback.previous = error_context_stack;
+	error_context_stack = &errcallback;
+
+	/* Find smgr relation for buffer */
+	if (reln == NULL)
+		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
+										buf->tag.blockNum,
+										reln->smgr_rlocator.locator.spcOid,
+										reln->smgr_rlocator.locator.dbOid,
+										reln->smgr_rlocator.locator.relNumber);
+
+	buf_state = LockBufHdr(buf);
+
+	/*
+	 * Run PageGetLSN while holding header lock, since we don't have the
+	 * buffer locked exclusively in all cases.
+	 */
+	recptr = BufferGetLSN(buf);
+
+	/* To check if block content changes while flushing. - vadim 01/17/97 */
+	UnlockBufHdrExt(buf, buf_state,
+					0, BM_JUST_DIRTIED,
+					0);
+
+	/*
+	 * Force XLOG flush up to buffer's LSN.  This implements the basic WAL
+	 * rule that log updates must hit disk before any of the data-file changes
+	 * they describe do.
+	 *
+	 * However, this rule does not apply to unlogged relations, which will be
+	 * lost after a crash anyway.  Most unlogged relation pages do not bear
+	 * LSNs since we never emit WAL records for them, and therefore flushing
+	 * up through the buffer LSN would be useless, but harmless.  However,
+	 * GiST indexes use LSNs internally to track page-splits, and therefore
+	 * unlogged GiST pages bear "fake" LSNs generated by
+	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
+	 * LSN counter could advance past the WAL insertion point; and if it did
+	 * happen, attempting to flush WAL through that location would fail, with
+	 * disastrous system-wide consequences.  To make sure that can't happen,
+	 * skip the flush if the buffer isn't permanent.
+	 */
+	if (buf_state & BM_PERMANENT)
+		XLogFlush(recptr);
+
+	/*
+	 * Now it's safe to write the buffer to disk. Note that no one else should
+	 * have been able to write it, while we were busy with log flushing,
+	 * because we got the exclusive right to perform I/O by setting the
+	 * BM_IO_IN_PROGRESS bit.
+	 */
+	bufBlock = BufHdrGetBlock(buf);
+
+	/*
+	 * Update page checksum if desired.  Since we have only shared lock on the
+	 * buffer, other processes might be updating hint bits in it, so we must
+	 * copy the page to private storage if we do checksumming.
+	 */
+	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
+
+	io_start = pgstat_prepare_io_time(track_io_timing);
+
+	/*
+	 * bufToWrite is either the shared buffer or a copy, as appropriate.
+	 */
+	smgrwrite(reln,
+			  BufTagGetForkNum(&buf->tag),
+			  buf->tag.blockNum,
+			  bufToWrite,
+			  false);
+
+	/*
+	 * When a strategy is in use, only flushes of dirty buffers already in the
+	 * strategy ring are counted as strategy writes (IOCONTEXT
+	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
+	 * statistics tracking.
+	 *
+	 * If a shared buffer initially added to the ring must be flushed before
+	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
+	 *
+	 * If a shared buffer which was added to the ring later because the
+	 * current strategy buffer is pinned or in use or because all strategy
+	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
+	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
+	 * (from_ring will be false).
+	 *
+	 * When a strategy is not in use, the write can only be a "regular" write
+	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
+	 */
+	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
+							IOOP_WRITE, io_start, 1, BLCKSZ);
+
+	pgBufferUsage.shared_blks_written++;
+
+	/*
+	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
+	 * end the BM_IO_IN_PROGRESS state.
+	 */
+	TerminateBufferIO(buf, true, 0, true, false);
+
+	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
+									   buf->tag.blockNum,
+									   reln->smgr_rlocator.locator.spcOid,
+									   reln->smgr_rlocator.locator.dbOid,
+									   reln->smgr_rlocator.locator.relNumber);
 
-	if (PrepareFlushBuffer(buf, &lsn))
-		DoFlushBuffer(buf, reln, io_object, io_context, lsn);
+	/* Pop the error context stack */
+	error_context_stack = errcallback.previous;
 }
 
 /*
@@ -4776,12 +4920,17 @@ WriteBatchInit(BufferDesc *batch_start, uint32 max_batch_size,
  * accept it, they will provide the required block number and its
  * RelFileLocator and fork.
  *
- * If returning a buffer, also return its LSN.
+ * If the caller is holding the buftable_lock, it will be released after
+ * acquiring a pin on the buffer.
+ *
+ * max_lsn may be updated if the provided buffer LSN exceeds the current max
+ * LSN.
  */
 static BufferDesc *
 PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 								Buffer bufnum,
 								BufferTag *require,
+								LWLock *buftable_lock,
 								XLogRecPtr *lsn)
 {
 	BufferDesc *bufdesc;
@@ -4843,6 +4992,12 @@ PrepareOrRejectEagerFlushBuffer(BufferAccessStrategy strategy,
 	if (!PinBuffer(bufdesc, strategy, /* skip_if_not_valid */ true))
 		goto reject_buffer;
 
+	if (buftable_lock)
+	{
+		LWLockRelease(buftable_lock);
+		buftable_lock = NULL;
+	}
+
 	CheckBufferIsPinnedOnce(bufnum);
 
 	/* Now that we have the buffer pinned, recheck it's got the right block */
@@ -4875,6 +5030,8 @@ reject_buffer_unpin:
 	UnpinBuffer(bufdesc);
 
 reject_buffer:
+	if (buftable_lock)
+		LWLockRelease(buftable_lock);
 	return NULL;
 }
 
@@ -4883,15 +5040,17 @@ reject_buffer:
  * flushing, find additional buffers from the ring that can be combined into a
  * single write batch with the starting buffer.
  *
- * This function will pin and content lock all of the buffers that it
- * assembles for the IO batch. The caller is responsible for issuing the IO.
- *
- * batch_limit is the largest batch we are allowed to construct given the
- * remaining blocks in the table, the number of available pins, and the
- * current configuration.
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
  *
  * batch is an output parameter that this function will fill with the needed
  * information to issue this IO.
+ *
+ * This function will pin and content lock all of the buffers that it
+ * assembles for the IO batch. The caller is responsible for issuing the IO.
  */
 static void
 FindStrategyFlushAdjacents(BufferAccessStrategy strategy,
@@ -4930,6 +5089,7 @@ FindStrategyFlushAdjacents(BufferAccessStrategy strategy,
 		batch->bufdescs[batch->n] =
 			PrepareOrRejectEagerFlushBuffer(strategy, bufnum,
 											&require,
+											NULL,
 											&lsn);
 
 		/*
@@ -4945,7 +5105,63 @@ FindStrategyFlushAdjacents(BufferAccessStrategy strategy,
 		/* Stop when we encounter a buffer that will break the run */
 		if (batch->bufdescs[batch->n] == NULL)
 			break;
+	}
+}
+
+
+/*
+ * Check if the blocks after my block are in shared buffers and dirty and if
+ * they are, write them out too.
+ */
+static void
+FindFlushAdjacents(BufferAccessStrategy strategy, BufferDesc *batch_start,
+				   uint32 batch_limit,
+				   BufferWriteBatch *batch)
+{
+	BufferTag	newTag;			/* identity of requested block */
+	uint32		newHash;		/* hash value for newTag */
+	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferTag	require;
+
+	/* create a tag so we can lookup the buffers */
+	InitBufferTag(&require, &batch->reln->smgr_rlocator.locator,
+				  batch->forkno, InvalidBlockNumber);
+
+	for (; batch->n < batch_limit; batch->n++)
+	{
+		XLogRecPtr	lsn;
+
+		require.blockNum = batch->start + batch->n;
+
+		Assert(BlockNumberIsValid(require.blockNum));
+
+		/* determine its hash code and partition lock ID */
+		newHash = BufTableHashCode(&newTag);
+		newPartitionLock = BufMappingPartitionLock(newHash);
+
+		/* see if the block is in the buffer pool already */
+		LWLockAcquire(newPartitionLock, LW_SHARED);
+		buf_id = BufTableLookup(&newTag, newHash);
+
+		/* The block may not even be in shared buffers. */
+		if (buf_id < 0)
+		{
+			LWLockRelease(newPartitionLock);
+			break;
+		}
 
+		batch->bufdescs[batch->n] =
+			PrepareOrRejectEagerFlushBuffer(strategy,
+											buf_id + 1,
+											&require,
+											newPartitionLock,
+											&lsn);
+		if (lsn > batch->max_lsn)
+			batch->max_lsn = lsn;
+
+		if (batch->bufdescs[batch->n] == NULL)
+			break;
 	}
 }
 
@@ -5009,158 +5225,6 @@ FlushBufferBatch(BufferWriteBatch *batch,
 	error_context_stack = errcallback.previous;
 }
 
-/*
- * Prepare the buffer with bufdesc for writing. Returns true if the buffer
- * actually needs writing and false otherwise. lsn returns the buffer's LSN if
- * the table is logged and still needs flushing.
- */
-static bool
-PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
-{
-	uint32		buf_state;
-
-	*lsn = InvalidXLogRecPtr;
-
-	/*
-	 * Try to start an I/O operation.  If StartBufferIO returns false, then
-	 * someone else flushed the buffer before we could, so we need not do
-	 * anything.
-	 */
-	if (!StartBufferIO(bufdesc, false, false))
-		return false;
-
-	buf_state = LockBufHdr(bufdesc);
-
-	/*
-	 * Record the buffer's LSN. We will force XLOG flush up to buffer's LSN.
-	 * This implements the basic WAL rule that log updates must hit disk
-	 * before any of the data-file changes they describe do.
-	 *
-	 * However, this rule does not apply to unlogged relations, which will be
-	 * lost after a crash anyway.  Most unlogged relation pages do not bear
-	 * LSNs since we never emit WAL records for them, and therefore flushing
-	 * up through the buffer LSN would be useless, but harmless.  However,
-	 * GiST indexes use LSNs internally to track page-splits, and therefore
-	 * unlogged GiST pages bear "fake" LSNs generated by
-	 * GetFakeLSNForUnloggedRel.  It is unlikely but possible that the fake
-	 * LSN counter could advance past the WAL insertion point; and if it did
-	 * happen, attempting to flush WAL through that location would fail, with
-	 * disastrous system-wide consequences.  To make sure that can't happen,
-	 * skip the flush if the buffer isn't permanent.
-	 *
-	 * We must hold the buffer header lock when examining the page LSN since
-	 * we don't have buffer exclusively locked in all cases.
-	 */
-	if (buf_state & BM_PERMANENT)
-		*lsn = BufferGetLSN(bufdesc);
-
-	/* To check if block content changes while flushing. - vadim 01/17/97 */
-	UnlockBufHdrExt(bufdesc, buf_state,
-					0, BM_JUST_DIRTIED,
-					0);
-	return true;
-}
-
-/*
- * Actually do the write I/O to clean a buffer. buf and reln may be modified.
- */
-static void
-DoFlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
-			  IOContext io_context, XLogRecPtr buffer_lsn)
-{
-	ErrorContextCallback errcallback;
-	instr_time	io_start;
-	Block		bufBlock;
-	char	   *bufToWrite;
-
-	/* Setup error traceback support for ereport() */
-	errcallback.callback = shared_buffer_write_error_callback;
-	errcallback.arg = buf;
-	errcallback.previous = error_context_stack;
-	error_context_stack = &errcallback;
-
-	/* Find smgr relation for buffer */
-	if (reln == NULL)
-		reln = smgropen(BufTagGetRelFileLocator(&buf->tag), INVALID_PROC_NUMBER);
-
-	TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-										buf->tag.blockNum,
-										reln->smgr_rlocator.locator.spcOid,
-										reln->smgr_rlocator.locator.dbOid,
-										reln->smgr_rlocator.locator.relNumber);
-
-	/* Force XLOG flush up to buffer's LSN */
-	if (XLogRecPtrIsValid(buffer_lsn))
-	{
-		Assert(pg_atomic_read_u32(&buf->state) & BM_PERMANENT);
-		XLogFlush(buffer_lsn);
-	}
-
-	/*
-	 * Now it's safe to write the buffer to disk. Note that no one else should
-	 * have been able to write it, while we were busy with log flushing,
-	 * because we got the exclusive right to perform I/O by setting the
-	 * BM_IO_IN_PROGRESS bit.
-	 */
-	bufBlock = BufHdrGetBlock(buf);
-
-	/*
-	 * Update page checksum if desired.  Since we have only shared lock on the
-	 * buffer, other processes might be updating hint bits in it, so we must
-	 * copy the page to private storage if we do checksumming.
-	 */
-	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
-
-	io_start = pgstat_prepare_io_time(track_io_timing);
-
-	/*
-	 * bufToWrite is either the shared buffer or a copy, as appropriate.
-	 */
-	smgrwrite(reln,
-			  BufTagGetForkNum(&buf->tag),
-			  buf->tag.blockNum,
-			  bufToWrite,
-			  false);
-
-	/*
-	 * When a strategy is in use, only flushes of dirty buffers already in the
-	 * strategy ring are counted as strategy writes (IOCONTEXT
-	 * [BULKREAD|BULKWRITE|VACUUM] IOOP_WRITE) for the purpose of IO
-	 * statistics tracking.
-	 *
-	 * If a shared buffer initially added to the ring must be flushed before
-	 * being used, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE.
-	 *
-	 * If a shared buffer which was added to the ring later because the
-	 * current strategy buffer is pinned or in use or because all strategy
-	 * buffers were dirty and rejected (for BAS_BULKREAD operations only)
-	 * requires flushing, this is counted as an IOCONTEXT_NORMAL IOOP_WRITE
-	 * (from_ring will be false).
-	 *
-	 * When a strategy is not in use, the write can only be a "regular" write
-	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
-	 */
-	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
-							IOOP_WRITE, io_start, 1, BLCKSZ);
-
-	pgBufferUsage.shared_blks_written++;
-
-	/*
-	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
-	 * end the BM_IO_IN_PROGRESS state.
-	 */
-	TerminateBufferIO(buf, true, 0, true, false);
-
-	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(BufTagGetForkNum(&buf->tag),
-									   buf->tag.blockNum,
-									   reln->smgr_rlocator.locator.spcOid,
-									   reln->smgr_rlocator.locator.dbOid,
-									   reln->smgr_rlocator.locator.relNumber);
-
-	/* Pop the error context stack */
-	error_context_stack = errcallback.previous;
-}
-
 /*
  * Convenience wrapper around FlushBuffer() that locks/unlocks the buffer
  * before/after calling FlushBuffer().
-- 
2.43.0

Re: Checkpointer write combining

Reply via email to