Re: Checkpointer write combining

Chao Li Wed, 25 Feb 2026 20:24:41 -0800

> On Feb 25, 2026, at 14:52, Chao Li <[email protected]> wrote:
> 
> 
> 
>> On Feb 20, 2026, at 07:41, Melanie Plageman <[email protected]> 
>> wrote:
>> 
>> On Wed, Jan 14, 2026 at 6:35 PM Melanie Plageman
>> <[email protected]> wrote:
>>> 
>>> Thanks! v13 attached.
>> 
>> I've added background writer write combining and then spent some time
>> refactoring and benchmarking this over the last few weeks, and I've
>> realized I won't be able to get it in shape for Postgres 19. I've
>> attached version 14 here with my WIP code so I can come back to it in
>> the next release.
>> 
>> 0012 is an invasive refactor of GetVictimBuffer() and all the
>> functions it calls that needs to be split apart into smaller commits
>> and redistributed throughout the other patches in the set. Other than
>> that, there are a number of todos both in the code and from a
>> benchmarking perspective:
>> 
>> - Probe for preceding as well as following blocks when looking for
>> blocks to combine
>> - Experiment with different batch sizes
>> - Enable and test bulkread and vacuum write combining
>> 
>> I also periodically get this troubling warning: "WARNING:  invalid
>> page in block 2 of relation "base/16384/2608_fsm"; zeroing out page".
>> So clearly there's at least one bug somewhere.
>> 
>> FWIW benchmarks showed a large improvement from backend and bgwriter
>> write combining, but I'll provide more detailed output from benchmarks
>> when I post a more reviewable patch set.
>> 
>> 0001 and 0002 are completely independent cleanup that should be pushed
>> anyway, so I might do that, but I didn't want to do so without having
>> posted on the list at least once.
>> 
>> - Melanie
>> <v14-0001-Make-FlushUnlockedBuffer-use-its-parameters.patch><v14-0002-Make-ScheduleBufferTagForWriteback-static.patch><v14-0003-Streamline-buffer-rejection-for-bulkreads-of-unl.patch><v14-0004-Allow-PinBuffer-to-skip-increasing-usage-count.patch><v14-0005-Eagerly-flush-bulkwrite-strategy-ring.patch><v14-0006-Write-combining-for-BAS_BULKWRITE.patch><v14-0007-Add-database-Oid-to-CkptSortItem.patch><v14-0008-Implement-checkpointer-data-write-combining.patch><v14-0009-Remove-SyncOneBuffer-and-refactor-BgBufferSync.patch><v14-0010-Normal-backend-write-combining.patch><v14-0011-Add-write-combining-for-background-writer.patch><v14-0012-WIP-refactor.patch>
> 
> A few comments for v14:
> 
> 1 - 0001
> ```
> - FlushBuffer(buf, reln, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> + FlushBuffer(buf, reln, io_object, io_context);
> ```
> 
> This changes the hardcode IOOBJECT_RELATION to io_object when calling 
> FlushBuffer(). But FlushBuffer() itself still uses IOOBJECT_RELATION instead 
> of io_object, so the change will actually not take effective:
> ```
>    pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
>                                              IOOP_WRITE, io_start, 1, BLCKSZ);
> ```
> 
> So, an update in FlushBuffer() is also needed.
> 
> 2 - 0003 - freelist.c
> ```
> bool
> StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool 
> from_ring)
> {
> + Assert(strategy);
> +
> /* We only do this in bulkread mode */
> if (strategy->btype != BAS_BULKREAD)
> return false;
> @@ -795,8 +800,14 @@ StrategyRejectBuffer(BufferAccessStrategy strategy, 
> BufferDesc *buf, bool from_r
> strategy->buffers[strategy->current] != BufferDescriptorGetBuffer(buf))
> return false;
> 
> + Assert(BufferIsLockedByMe(BufferDescriptorGetBuffer(buf)));
> + Assert(!(pg_atomic_read_u64(&buf->state) & BM_LOCKED));
> ```
> 
> I don’t quite understand Assert(!(pg_atomic_read_u64(&buf->state) & 
> BM_LOCKED));
> 
> The comment says “the buffer header spinlock must not be held,” but I doubt 
> this Assert actually ensures that. Since BM_LOCKED is just a bit in a shared 
> state, isn't it possible for a concurrent backend to grab the lock right when 
> we’re checking it?
> 
> If a concurrent process locks the header a split-second before the Assert, 
> we’ll get a false-positive crash in a dev environment even though the code is 
> fine. If they lock it a split-second after, then the Assert didn't really 
> catch anything. It feels like a race condition either way.
> 
> I think the only way to truly ensure a spinlock is "not held" is to actually 
> acquire it. If we’re just checking the bit like this, it’s just a snapshot 
> that might be stale by the time the instruction finishes. What was the 
> specific worry here—a self-deadlock, or something else?
> 
> 3 - 0004
> ```
> +typedef enum BufferUsageCountChange
> +{
> + BUC_ZERO,
> + BUC_MAX_ONE,
> + BUC_ONE,
> +} BufferUsageCountChange;
> ```
> 
> Can we add some brief comments to explain every enum item? I feel hard to 
> guess the meaning from the names without further reading the code.
> 
> 4 - 0005 - bufmgr.c
> ```
> +/*
> + * Prepare bufdesc for eager flushing.
> + *
> + * Given bufnum, return the buffer descriptor of the buffer to eagerly flush,
> + * pinned and locked and with BM_IO_IN_PROGRESS set, or NULL if this buffer
> + * does not contain a block that should be flushed.
> + *
> + * If returning a buffer, also return its LSN.
> + */
> +static BufferDesc *
> +PrepareOrRejectEagerFlushBuffer(Buffer bufnum)
> ```
> 
> This header comment looks stale.
> 
> * It says “and with BM_IO_IN_PROGRESS set”, but I don’t see where 
> BM_IO_IN_PROGRESS is set in this function.
> * It says “also return its LSN”, but I don’t see LSN is returned.
> 
> 5 - 0006
> ```
> + /* Start IO on the first buffer */
> + if (!StartBufferIO(buf_hdr, false, false))
> + goto clean;
> ```
> 
> This failure branch feels like leaking the content lock. The buffer was 
> already share-locked via BufferLockConditional() earlier, and I don’t see the 
> “clean" path unlock that content lock before returning.
> 
> 6 - 0009
> ```
> + FlushUnlockedBuffer(bufHdr, NULL, IOOBJECT_RELATION, IOCONTEXT_NORMAL);
> + UnpinBuffer(bufHdr);
> +
> + ScheduleBufferTagForWriteback(wb_context, IOCONTEXT_NORMAL, &bufHdr->tag);
> ```
> 
> bufHdr is unlocked and unpinned, it might be unsafe to still use bufHdr->tag. 
> I think we can copy bufHdr->tag to a local variable before unlock the buffer.
> 
> ~~ My brain is stuck, I will continue 0010 tomorrow ~~
> 

7 - 0010
```
- * batch_limit is the largest batch we are allowed to construct given the
- * remaining blocks in the table, the number of available pins, and the
- * current configuration.
+ * max_batch_size is the maximum number of blocks that can be combined into a
+ * single write in general. This function, based on the block number of start,
+ * will determine the maximum IO size for this particular write given how much
+ * of the file remains. max_batch_size is provided by the caller so it doesn't
+ * have to be recalculated for each write.
```

I don’t see the function FindStrategyFlushAdjacents() has a parameter named 
max_batch_size, but batch_limit is still there.

8 - 0010
```
 static void
+FindFlushAdjacents(BufferDesc *batch_start,
+                                  uint32 batch_limit,
+                                  BufferWriteBatch *batch)
+{
+       BufferTag       require;                /* requested block's buffer tag 
*/
+       uint32          newHash;                /* hash value for require */
+       LWLock     *newPartitionLock;   /* buffer partition lock for it */
+       int                     buf_id;
+
+       /* create a tag so we can lookup the buffers */
+       InitBufferTag(&require, &batch->reln->smgr_rlocator.locator,
+                                 batch->forkno, InvalidBlockNumber);
+
+       for (; batch->n < batch_limit; batch->n++)
+       {
+               XLogRecPtr      lsn;
+
+               require.blockNum = batch->start + batch->n;
+
+               Assert(BlockNumberIsValid(require.blockNum));
```

When I first time read the Assert, I was confused how it can assume the block 
to be valid, then I realized that WriteBatchInit() ensures a safe “limit”. 
Maybe add a brief comment for the Assert.

9 - 0010
```
+               batch->bufdescs[batch->n] =
+                       PrepareOrRejectEagerFlushBuffer(buf_id + 1,
+                                                                               
        &require,
+                                                                               
        newPartitionLock,
+                                                                               
        &lsn);
+               if (lsn > batch->max_lsn)
+                       batch->max_lsn = lsn;
+
+               if (batch->bufdescs[batch->n] == NULL)
+                       break;
```

I think we can move if (batch->bufdescs[batch->n] == NULL) to before if (lsn > 
batch->max_lsn).

This is a correctness issue, because PrepareOrRejectEagerFlushBuffer will set 
lsn to InvalidXLogRecPtr when returns NULL, but doing the NULL check early just 
feels more reasonable.

10 - 0010
```
+ * max_lsn may be updated if the provided buffer LSN exceeds the current max
+ * LSN.
  */
 static BufferDesc *
 PrepareOrRejectEagerFlushBuffer(Buffer bufnum,
                                                                BufferTag 
*require,
+                                                               LWLock 
*buftable_lock,
                                                                XLogRecPtr *lsn)
```

The function doesn’t have a parameter named max_lsn, is it “lsn”?

11 - 0010
```
+FindFlushAdjacents(BufferDesc *batch_start,
+                                  uint32 batch_limit,
+                                  BufferWriteBatch *batch)
```

Looks like batch_start is not used at all in this function.

12 - 0011
```
+ * Callers specify if and by how much they want to bump the buffer's usage
+ * count.
```

I don’t get what this comment means.

~~ As 0012 is WIP, I didn't review it. ~~

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/
Re: Checkpointer write combining

Reply via email to