Hi all, Thank you all for the patches. I am keeping this as a single patch because the refactoring, batching behavior and instrumentation are tightly coupled and all serve one purpose to reduce checkpoint writeback overhead while making the effect observable. Due to version and context differences, the patches did not apply cleanly in my development environment. Instead, I studied the patches and went through the logic in detail and then implemented the same ideas directly in my current tree adapting them wherever needed. The implementation was then validated with instrumentation and measurements.
Before batching: 2026-01-22 17:27:26.969 IST [148738] LOG: checkpoint complete: wrote 15419 buffers (94.1%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0 removed, 25 recycled; write=0.325 s, sync=0.284 s, total=0.754 s; sync files=30, longest=0.227 s, average=0.010 s; distance=407573 kB, estimate=407573 kB; lsn=0/1A5B8E30, redo lsn=0/1A5B8DD8 After batching: 2026-01-22 17:31:36.165 IST [148738] LOG: checkpoint complete: wrote 13537 buffers (82.6%), wrote 1 SLRU buffers; 0 WAL file(s) added, 0 removed, 25 recycled; write=0.260 s, sync=0.211 s, total=0.625 s; sync files=3, longest=0.205 s, average=0.070 s; distance=404310 kB, estimate=407247 kB; lsn=0/3308E738, redo lsn=0/3308E6E0 Debug instrumentation with (batch size = 16) confirms the batching behavior itself, buffers_written = 6196 writeback_calls = 389 On average: I am getting 15.9 i.e approx 16 buffers per writeback This shows that writebacks are issued per batch rather than per buffer, while WAL ordering and durability semantics remain unchanged. The change remains localized to BufferSync() and is intended to be a conservative and measurable improvement to checkpoint I/O behavior. I am attaching the patches herewith for review. I am happy to adjust the approach if there are concerns or suggestions. Looking forward to more feedback. Regards, Soumya
From 99354adda53d07d28940810b429648a855eeaf12 Mon Sep 17 00:00:00 2001 From: Soumya <[email protected]> Date: Fri, 23 Jan 2026 12:24:11 +0530 Subject: [PATCH] Batch buffer writebacks during checkpoints Signed-off-by: Soumya <[email protected]> --- src/backend/storage/buffer/bufmgr.c | 77 ++++++++++++++++++++++++++--- 1 file changed, 71 insertions(+), 6 deletions(-) diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c index 6f935648ae9..c92d638f804 100644 --- a/src/backend/storage/buffer/bufmgr.c +++ b/src/backend/storage/buffer/bufmgr.c @@ -3467,6 +3467,15 @@ BufferSync(int flags) int i; uint64 mask = BM_DIRTY; WritebackContext wb_context; + uint64 ckpt_write_count = 0; + uint64 ckpt_issue_writeback_calls = 0; + + /* --- checkpoint write batching --- */ + #define CHECKPOINT_WRITE_BATCH 16 + + int batch_bufs[CHECKPOINT_WRITE_BATCH]; + int batch_count = 0; + /* --- checkpoint write batching --- */ /* * Unless this is a shutdown checkpoint or we have been explicitly told, @@ -3666,12 +3675,37 @@ BufferSync(int flags) */ if (pg_atomic_read_u64(&bufHdr->state) & BM_CHECKPOINT_NEEDED) { - if (SyncOneBuffer(buf_id, false, &wb_context) & BUF_WRITTEN) - { - TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(buf_id); - PendingCheckpointerStats.buffers_written++; - num_written++; - } + /* + * Collect buffer into a small local batch. + * Phase 1: batch size is 1, so this is behavior-neutral. + */ + batch_bufs[batch_count++] = buf_id; + + /* + * Flush batch if full. + */ + if (batch_count == CHECKPOINT_WRITE_BATCH) + { + int j; + + for (j = 0; j < batch_count; j++) + { + if (SyncOneBuffer(batch_bufs[j], false, &wb_context) & BUF_WRITTEN) + { + TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(batch_bufs[j]); + PendingCheckpointerStats.buffers_written++; + num_written++; + ckpt_write_count++; + } + } + /* + * Issue writeback for this batch to amortize syscall cost. + * This does NOT change durability semantics. + */ + ckpt_issue_writeback_calls++; + IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL); + batch_count = 0; + } } /* @@ -3701,6 +3735,31 @@ BufferSync(int flags) CheckpointWriteDelay(flags, (double) num_processed / num_to_scan); } + /* + * Flush any remaining buffers in the batch. + */ + if (batch_count > 0) + { + int j; + + for (j = 0; j < batch_count; j++) + { + if (SyncOneBuffer(batch_bufs[j], false, &wb_context) & BUF_WRITTEN) + { + TRACE_POSTGRESQL_BUFFER_SYNC_WRITTEN(batch_bufs[j]); + PendingCheckpointerStats.buffers_written++; + num_written++; + ckpt_write_count++; + } + } + ckpt_issue_writeback_calls++; + IssuePendingWritebacks(&wb_context, IOCONTEXT_NORMAL); + batch_count = 0; + } + + /* --- checkpoint instrumentation --- */ + ckpt_issue_writeback_calls++; + /* * Issue all pending flushes. Only checkpointer calls BufferSync(), so * IOContext will always be IOCONTEXT_NORMAL. @@ -3717,6 +3776,12 @@ BufferSync(int flags) */ CheckpointStats.ckpt_bufs_written += num_written; + ereport(DEBUG1, + (errmsg("checkpoint BufferSync stats: buffers_written=%lu, " + "writeback_calls=%lu", + ckpt_write_count, + ckpt_issue_writeback_calls))); + TRACE_POSTGRESQL_BUFFER_SYNC_DONE(NBuffers, num_written, num_to_scan); } -- 2.34.1
