Re: Checkpointer write combining

Chao Li Wed, 19 Nov 2025 18:32:49 -0800

> On Nov 20, 2025, at 06:12, Melanie Plageman <[email protected]> wrote:
> 
> Are you sure you applied 0005? It is the one that adds dbId to CkptSortItem.
> 

My bad. Yes, I missed to download and apply 0005.

> - Melanie
> <v10-0001-Refactor-goto-into-for-loop-in-GetVictimBuffer.patch><v10-0002-Split-FlushBuffer-into-two-parts.patch><v10-0003-Eagerly-flush-bulkwrite-strategy-ring.patch><v10-0004-Write-combining-for-BAS_BULKWRITE.patch><v10-0005-Add-database-Oid-to-CkptSortItem.patch><v10-0006-Implement-checkpointer-data-write-combining.patch><v10-0007-Refactor-SyncOneBuffer-for-bgwriter-use-only.patch>

I went through the whole patch set again, and got a few more comments:

1 - 0002
```
+/*
+ * Prepare and write out a dirty victim buffer.
+ *
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                                 bool from_ring, IOContext io_context)
+{
+       XLogRecPtr      max_lsn = InvalidXLogRecPtr;
+       LWLock     *content_lock;
 
-       /* Find smgr relation for buffer */
-       if (reln == NULL)
-               reln = smgropen(BufTagGetRelFileLocator(&buf->tag), 
INVALID_PROC_NUMBER);
+       /* Set up this victim buffer to be flushed */
+       if (!PrepareFlushBuffer(bufdesc, &max_lsn))
+               return;
```

I believe when PrepareFlushBuffer(bufdesc, &max_lsn) is false, before “return”, 
we should release “content_lock”.

Because the function comment clearly says “the content lock must be held 
exclusively”. Also, looking at the code where CleanVictimBuffer() is called:
```
                        if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
                        {
                                LWLockRelease(content_lock);
                                UnpinBuffer(buf_hdr);
                                continue;
                        }

                        /* Content lock is released inside CleanVictimBuffer */
                        CleanVictimBuffer(buf_hdr, from_ring, io_context);
```

In the previous “if” clause, it releases content_lock.

2 - 0002
```
+ * Buffer must be pinned, the content lock must be held exclusively, and the
+ * buffer header spinlock must not be held. The exclusive lock is released and
+ * the buffer is returned pinned but not locked.
+ *
+ * bufdesc may be modified.
+ */
+static void
+CleanVictimBuffer(BufferDesc *bufdesc,
+                                 bool from_ring, IOContext io_context)
```

The function comment says "the content lock must be held exclusively”, but from 
the code:
```
                        content_lock = BufferDescriptorGetContentLock(buf_hdr);
                        if (!LWLockConditionalAcquire(content_lock, LW_SHARED))
                        {
                                /*
                                 * Someone else has locked the buffer, so give 
it up and loop
                                 * back to get another one.
                                 */
                                UnpinBuffer(buf_hdr);
                                continue;
                        }

                        /*
                         * If using a nondefault strategy, and writing the 
buffer would
                         * require a WAL flush, let the strategy decide whether 
to go
                         * ahead and write/reuse the buffer or to choose 
another victim.
                         * We need the content lock to inspect the page LSN, so 
this can't
                         * be done inside StrategyGetBuffer.
                         */
                        if (StrategyRejectBuffer(strategy, buf_hdr, from_ring))
                        {
                                LWLockRelease(content_lock);
                                UnpinBuffer(buf_hdr);
                                continue;
                        }

                        /* Content lock is released inside CleanVictimBuffer */
                        CleanVictimBuffer(buf_hdr, from_ring, io_context);
```
Content_lock is acquired with LW_SHARED.

3 - 0003

In CleanVictimBuffer(), more logic are added, but the content_lock leak problem 
is still there.

4 - 0003
```
+/*
+ * Returns true if the buffer needs WAL flushed before it can be written out.
+ * Caller must not already hold the buffer header spinlock.
+ */
+static bool
+BufferNeedsWALFlush(BufferDesc *bufdesc, XLogRecPtr *lsn)
+{
+       uint32          buf_state = LockBufHdr(bufdesc);
+
+       *lsn = BufferGetLSN(bufdesc);
+
+       UnlockBufHdr(bufdesc);
+
+       /*
+        * See buffer flushing code for more details on why we condition this on
+        * the relation being logged.
+        */
+       return buf_state & BM_PERMANENT && XLogNeedsFlush(*lsn);
+}
```

This is a new function. I am thinking that if we should only update “lsn” when 
not BM_PERMANENT? Because from the existing code:
```
        if (buf_state & BM_PERMANENT)
                XLogFlush(recptr);
```

XLogFlush should only happen when BM_PERMANENT.

5 - 0004 - I am thinking if that could be a race condition?

PageSetBatchChecksumInplace() is called once after all pages were pinned 
earlier, but other backends may modify the page contents while the batch is 
being assembled, because batching only holds content_lock per page temporarily, 
not across the entire run.
So that:
    • Page A pinned + content lock acquired + LSN read → content lock released
    • Another backend modifies Page A and sets new LSN, dirties buffer
    • Page A is written by this batch using an outdated checksum / outdated 
page contents

6 - 0006 - Ah, 0006 seems to have solved comment 3 about BM_PERMANENT.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Checkpointer write combining

Reply via email to