Re: Checkpointer write combining

Soumya S Murali Thu, 22 Jan 2026 23:03:32 -0800

Hi all,

Thank you all for the patches.
I am keeping this as a single patch because the refactoring, batching
behavior and instrumentation are tightly coupled and all serve one
purpose to reduce checkpoint writeback overhead while making the
effect observable. Due to version and context differences, the patches
did not apply cleanly in my development environment. Instead, I
studied the patches and went through the logic in detail and then
implemented the same ideas directly in my current tree adapting them
wherever needed. The implementation was then validated with
instrumentation and measurements.


Before batching:
2026-01-22 17:27:26.969 IST [148738] LOG:  checkpoint complete: wrote
15419 buffers (94.1%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0
removed, 25 recycled; write=0.325 s, sync=0.284 s, total=0.754 s; sync
files=30, longest=0.227 s, average=0.010 s; distance=407573 kB,
estimate=407573 kB; lsn=0/1A5B8E30, redo lsn=0/1A5B8DD8

After batching:
2026-01-22 17:31:36.165 IST [148738] LOG:  checkpoint complete: wrote
13537 buffers (82.6%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0
removed, 25 recycled; write=0.260 s, sync=0.211 s, total=0.625 s; sync
files=3, longest=0.205 s, average=0.070 s; distance=404310 kB,
estimate=407247 kB; lsn=0/3308E738, redo lsn=0/3308E6E0

Debug instrumentation with (batch size = 16) confirms the batching
behavior itself,
buffers_written = 6196
writeback_calls = 389
On average: I am getting 15.9 i.e approx 16 buffers per writeback
This shows that writebacks are issued per batch rather than per
buffer, while WAL ordering and durability semantics remain unchanged.
The change remains localized to BufferSync() and is intended to be a
conservative and measurable improvement to checkpoint I/O behavior. I
am attaching the patches herewith for review.
I am happy to adjust the approach if there are concerns or
suggestions. Looking forward to more feedback.

Regards,
Soumya

From 99354adda53d07d28940810b429648a855eeaf12 Mon Sep 17 00:00:00 2001
From: Soumya <[email protected]>
Date: Fri, 23 Jan 2026 12:24:11 +0530
Subject: [PATCH] Batch buffer writebacks during checkpoints

Signed-off-by: Soumya <[email protected]>
---
 src/backend/storage/buffer/bufmgr.c | 77 ++++++++++++++++++++++++++---
 1 file changed, 71 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f935648ae9..c92d638f804 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -3467,6 +3467,15 @@ BufferSync(int flags)
 	int			i;
 	uint64		mask = BM_DIRTY;
 	WritebackContext wb_context;
+	uint64  ckpt_write_count = 0;
+	uint64  ckpt_issue_writeback_calls = 0;
+
+	/* --- checkpoint write batching --- */
+	#define CHECKPOINT_WRITE_BATCH 16
+
+	int     batch_bufs[CHECKPOINT_WRITE_BATCH];
+	int     batch_count = 0;
+	/* --- checkpoint write batching --- */
 
 	/*
 	 * Unless this is a shutdown checkpoint or we have been explicitly told,
@@ -3666,12 +3675,37 @@ BufferSync(int flags)
 		 */
 		if (pg_atomic_read_u64(&bufHdr->state) & BM_CHECKPOINT_NEEDED)
 		{
-			if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN)
-			{
-				TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id);
-				PendingCheckpointerStats.buffers_written++;
-				num_written++;
-			}
+    		/*
+    		 * Collect buffer into a small local batch.
+    		 * Phase 1: batch size is 1, so this is behavior-neutral.
+    		 */
+    		batch_bufs[batch_count++] = buf_id;
+
+    		/*
+    		 * Flush batch if full.
+    		 */
+    		if (batch_count == CHECKPOINT_WRITE_BATCH)
+    		{
+        		int j;
+
+        		for (j = 0; j < batch_count; j++)
+        		{
+            		if (SyncOneBuffer(batch_bufs[j], false, &wb_context) & BUF_WRITTEN)
+            		{
+                		TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(batch_bufs[j]);
+                		PendingCheckpointerStats.buffers_written++;
+                		num_written++;
+                		ckpt_write_count++;
+            		}
+        		}
+				/*
+				 * Issue writeback for this batch to amortize syscall cost.
+				 * This does NOT change durability semantics.
+				 */
+				ckpt_issue_writeback_calls++;
+				IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
+				batch_count = 0;
+    		}
 		}
 
 		/*
@@ -3701,6 +3735,31 @@ BufferSync(int flags)
 		CheckpointWriteDelay(flags, (double) num_processed / num_to_scan);
 	}
 
+	/*
+	 * Flush any remaining buffers in the batch.
+	 */
+	if (batch_count > 0)
+	{
+    	int j;
+
+    	for (j = 0; j < batch_count; j++)
+    	{
+        	if (SyncOneBuffer(batch_bufs[j], false, &wb_context) & BUF_WRITTEN)
+        	{
+            	TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(batch_bufs[j]);
+            	PendingCheckpointerStats.buffers_written++;
+            	num_written++;
+            	ckpt_write_count++;
+        	}
+    	}
+		ckpt_issue_writeback_calls++;
+		IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL);
+		batch_count = 0;
+	}
+
+	/* --- checkpoint instrumentation --- */
+    ckpt_issue_writeback_calls++;
+	
 	/*
 	 * Issue all pending flushes. Only checkpointer calls BufferSync(), so
 	 * IOContext will always be IOCONTEXT_NORMAL.
@@ -3717,6 +3776,12 @@ BufferSync(int flags)
 	 */
 	CheckpointStats.ckpt_bufs_written += num_written;
 
+    ereport(DEBUG1,
+            (errmsg("checkpoint BufferSync stats: buffers_written=%lu, "
+                    "writeback_calls=%lu",
+                    ckpt_write_count,
+                    ckpt_issue_writeback_calls)));
+
 	TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan);
 }
 
-- 
2.34.1

Re: Checkpointer write combining

Reply via email to