Re: Optimize CPU usage of dropping buffers during recovery

Jingtang Zhang Thu, 04 Jun 2026 00:19:56 -0700

Hi hackers,

Reposting this patch to bump the thread -- the v2 I sent a
while ago didn't get any reviews.


To recap the problem: during crash recovery, when a transaction
involves creating and then dropping an empty relation (or a
relation that happens to be empty at the time of the crash),
DropRelationsAllBuffers() falls back to scanning the entire
buffer pool. This is because no WAL record ever references a
block of that relation, so smgr_cached_nblocks is never
initialized - it stays as InvalidBlockNumber, and the optimized
BufMapping lookup path introduced by commit d6ad34f3410 cannot
be used.

This can easily happen in practice. For example, refreshing a
materialized view whose query returns no rows, or truncating a
table and reloading it but the source query returns nothing.
With a large shared_buffers setting, the recovery time becomes
dominated by repeated full scans of the buffer pool.

The attached patch addresses this with a small change in
smgr_redo so that the optimized BufMapping path can be taken
in this scenario. The impact, measured on an earlier run with
16GB shared_buffers and 10 clients * 500 CREATE/DROP
transactions during crash recovery:

  w/o patch: CPU user: 77.58 s, system: 0.27 s
  patched:   CPU user: 0.14 s,  system: 0.09 s

In more detail, the fix has two parts that work together:

1. In smgr_redo(XLOG_SMGR_CREATE), set smgr_cached_nblocks to
   0 for the MAIN, FSM, and VM forks. A CREATE record means the
   relation was just created with zero blocks, so 0 is the
   correct cached value.

2. In XLogReadBufferExtended(), if the cached value is 0,
   invalidate it before calling smgrnblocks(), so that it does a
   fresh lseek to get the real file size. This handles the case
   where the relation was extended before the crash. The extra
   lseek is acceptable here because we are about to do I/O to
   read the block anyway.

For the common case where empty relations are created and
dropped without any data written, part (1) alone is sufficient:
the cached value stays 0, DropRelationsAllBuffers() sees 0
blocks to invalidate, takes the fast path, and returns
immediately. Part (2) is only needed as a safety net for the
less common case.

Regarding the INIT fork question I raised in v1: I now believe
we don't need to handle it. The CREATE record for an INIT fork
has forkNum == INIT_FORKNUM, not MAIN_FORKNUM, so the new
condition simply does not apply to it.

I've also verified that relfilenumber reuse within a single
recovery is not a concern. As documented in mdunlink(), the
first segment file is kept until the next checkpoint to prevent
relfilenumber reuse, and during redo there is no possibility of
allocating new relfilenumbers.

v3 attached, rebased on current master. Compared to v2, the
comments have been improved to better explain the relationship
between the two code changes and why the approach is safe.
Note that the Perl test case included in the patch is
only meant as a benchmark script for conveniently reproducing
the problem and measuring the improvement - it is not intended
as a regression test.

Any thoughts or reviews would be greatly appreciated.

--
Regards, Jingtang

v3-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch
Description: Binary data

Re: Optimize CPU usage of dropping buffers during recovery

Reply via email to