Re: Checkpointer write combining

Chao Li Wed, 14 Jan 2026 03:27:17 -0800

> On Jan 14, 2026, at 06:24, Melanie Plageman <[email protected]> wrote:
> 
> Attached v12 fixes a variety of buglets I found throughout the patch set.
> 
> I've also done quite a bit of refactoring. The scope of the
> refactoring varies from inlining some helpers to introducing new input
> argument structs.
> 
> 0001 is independently valuable as it optimizes StrategyRejectBuffer()
> a bit and makes GetVictimBuffer() cleaner
> 
> 0002-0007 were largely present in older versions of the patch set
> 
> 0008 is new -- it is an early version of batching for normal backends
> flushing a buffer to obtain a clean one. Right now, it checks if the
> two blocks succeeding the target block are in shared buffers and
> dirty, and, if so, writes them out with the target buffer. I haven't
> started testing or benchmarking it because I need to convert bgwriter
> to use write combining to be able to benchmark it effectively. But I
> thought I would get the code out there sooner rather than later.
> 
> It's a lot harder with my current code structure to add the target
> block's predecessor if it is dirty and read to write out. I wonder how
> important this is vs just the two succeeding blocks.
> 
> - Melanie
> <v12-0001-Streamline-buffer-rejection-for-bulkreads-of-unl.patch><v12-0002-Split-FlushBuffer-into-two-parts.patch><v12-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v12-0004-Write-combining-for-BAS_BULKWRITE.patch><v12-0005-Add-database-Oid-to-CkptSortItem.patch><v12-0006-Implement-checkpointer-data-write-combining.patch><v12-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch><v12-0008-Eagerly-flush-buffer-successors.patch>

Hi Melanie,

I went through the patch set again today, and paid special attention to 0001 
and 0008 that I seemed to not review before. Here are comments I got today:

1 - 0001
```
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -15,6 +15,7 @@
 #ifndef BUFMGR_INTERNALS_H
 #define BUFMGR_INTERNALS_H
 
+#include "access/xlogdefs.h"
```

I tried to build without adding this include, build still passed. I think 
that’s because there is a include path: storage/bufmgr.h -> storage/bufpage.h 
-> access/xlogger.h.

So, maybe we can remove this include.

2 - 0001
```
+ * The buffer must be pinned and content locked and the buffer header spinlock
+ * must not be held.
+ *
  * Returns true if buffer manager should ask for a new victim, and false
  * if this buffer should be written and re-used.
  */
 bool
 StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool 
from_ring)
 {
+       XLogRecPtr      lsn;
+
+       if (!strategy)
+               return false;
```

As the new comment stated, the buffer must be pinned, maybe we can add an 
assert to ensure that:
```
Assert(BufferIsPinned(buffer));
```

Similarly, maybe we can also assert the locks are not held:
```
Assert(BufferDescriptorGetContentLock(buffer));
Assert(!pg_atomic_read_u32(&buf->state) & BM_LOCKED);
```

3 - 0001 - bufmgr.c - BufferNeedsWALFlush()
```
+       buffer = BufferDescriptorGetBuffer(bufdesc);
+       page = BufferGetPage(buffer);
+
+       Assert(BufferIsValid(buffer));
```

I think the Assert should be moved to before "page = BufferGetPage(buffer);”.

4 - 0001 - bufmgr.c
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * *lsn is set to the current page LSN.
+ *
+ * If the result is required to be correct, the caller must hold a buffer
+ * content lock. If they only hold a shared content lock, we'll need to
+ * acquire the buffer header spinlock, so they must not already hold it.
+ *
+ * If the buffer is unlogged, *lsn shouldn't be used by the caller and is set
+ * to InvalidXLogRecPtr.
+ */
+bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, bool exclusive, XLogRecPtr *lsn)
```

I think the “exclusive" parameter is a bit subtle. The parameter is not 
explicitly explained in the header comment, though there is a paragraph that 
explains different behaviors when the caller hold and not hold content lock. 
Maybe we can rename to a more direct name: hold_exclusive_content_lock, or a 
shorter one hold_content_lock.

5 - 0002 - commit message
```
actual buffer flushing step. This separation procides symmetry with
```

Typo: procides -> provides

6 - 0002
```
+        * We must hold the buffer header lock when examining the page LSN since
+        * don't have buffer exclusively locked in all cases.
```

Looks like it’s better to add “we” between “since” and “don’t” => since we 
don’t have ...

7 - 0002
```
+/*
+ * Prepare the buffer with bufdesc for writing. Returns true if the buffer
+ * actually needs writing and false otherwise. lsn returns the buffer's LSN if
+ * the table is logged.
+ */
+static bool
+PrepareFlushBuffer(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
        uint32          buf_state;
 
        /*
@@ -4425,42 +4445,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
IOObject io_object,
         * someone else flushed the buffer before we could, so we need not do
         * anything.
         */
-       if (!StartBufferIO(buf, false, false))
-               return;
-
-       /* Setup error traceback support for ereport() */
-       errcallback.callback = shared_buffer_write_error_callback;
-       errcallback.arg = buf;
-       errcallback.previous = error_context_stack;
-       error_context_stack = &errcallback;
-
-       /* Find smgr relation for buffer */
-       if (reln == NULL)
-               reln = smgropen(BufTagGetRelFileLocator(&buf->tag), 
INVALID_PROC_NUMBER);
-
-       TRACE_POSTGRESQL_BUFFER_FLUSH_START(BufTagGetForkNum(&buf->tag),
-                                                                               
buf->tag.blockNum,
-                                                                               
reln->smgr_rlocator.locator.spcOid,
-                                                                               
reln->smgr_rlocator.locator.dbOid,
-                                                                               
reln->smgr_rlocator.locator.relNumber);
-
-       buf_state = LockBufHdr(buf);
-
-       /*
-        * Run PageGetLSN while holding header lock, since we don't have the
-        * buffer locked exclusively in all cases.
-        */
-       recptr = BufferGetLSN(buf);
+       if (!StartBufferIO(bufdesc, false, false))
+               return false;
 
-       /* To check if block content changes while flushing. - vadim 01/17/97 */
-       UnlockBufHdrExt(buf, buf_state,
-                                       0, BM_JUST_DIRTIED,
-                                       0);
+       *lsn = InvalidXLogRecPtr;
+       buf_state = LockBufHdr(bufdesc);
```

The header comment says “lsn returns the buffer's LSN if the table is logged”, 
which looks inaccurate, because if StartBufferIO() is true, the function 
returns early without setting *lsn.

8 - 0008 - comment message
```
When flushing a dirty buffer, check if it the two blocks following it
``` 

Nit: “if it the” -> “if the"

9 - 0008
```
+ * Check if the blocks after my block are in shared buffers and dirty and if
+ * it is, write them out too
```

Nit: “if it is” -> “if they are”

10 - 0008
```
+       BlockNumber max_batch_size = 3; /* we only look for two successors */
```

Using type BlockNumber for batch size seems odd. I would suggest change to 
uint32.

11 - 0008 - WriteBatchInit()
```
+       LockBufHdr(batch_start);
+       batch->max_lsn = BufferGetLSN(batch_start);
+       UnlockBufHdr(batch_start);
```

Should we check unlogged buffer before assigning max_lsn? In previous commits, 
we have done that in many places.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Checkpointer write combining

Reply via email to