Hi hackers, Reposting this patch to bump the thread -- the v2 I sent a while ago didn't get any reviews.
To recap the problem: during crash recovery, when a transaction involves creating and then dropping an empty relation (or a relation that happens to be empty at the time of the crash), DropRelationsAllBuffers() falls back to scanning the entire buffer pool. This is because no WAL record ever references a block of that relation, so smgr_cached_nblocks is never initialized - it stays as InvalidBlockNumber, and the optimized BufMapping lookup path introduced by commit d6ad34f3410 cannot be used. This can easily happen in practice. For example, refreshing a materialized view whose query returns no rows, or truncating a table and reloading it but the source query returns nothing. With a large shared_buffers setting, the recovery time becomes dominated by repeated full scans of the buffer pool. The attached patch addresses this with a small change in smgr_redo so that the optimized BufMapping path can be taken in this scenario. The impact, measured on an earlier run with 16GB shared_buffers and 10 clients * 500 CREATE/DROP transactions during crash recovery: w/o patch: CPU user: 77.58 s, system: 0.27 s patched: CPU user: 0.14 s, system: 0.09 s In more detail, the fix has two parts that work together: 1. In smgr_redo(XLOG_SMGR_CREATE), set smgr_cached_nblocks to 0 for the MAIN, FSM, and VM forks. A CREATE record means the relation was just created with zero blocks, so 0 is the correct cached value. 2. In XLogReadBufferExtended(), if the cached value is 0, invalidate it before calling smgrnblocks(), so that it does a fresh lseek to get the real file size. This handles the case where the relation was extended before the crash. The extra lseek is acceptable here because we are about to do I/O to read the block anyway. For the common case where empty relations are created and dropped without any data written, part (1) alone is sufficient: the cached value stays 0, DropRelationsAllBuffers() sees 0 blocks to invalidate, takes the fast path, and returns immediately. Part (2) is only needed as a safety net for the less common case. Regarding the INIT fork question I raised in v1: I now believe we don't need to handle it. The CREATE record for an INIT fork has forkNum == INIT_FORKNUM, not MAIN_FORKNUM, so the new condition simply does not apply to it. I've also verified that relfilenumber reuse within a single recovery is not a concern. As documented in mdunlink(), the first segment file is kept until the next checkpoint to prevent relfilenumber reuse, and during redo there is no possibility of allocating new relfilenumbers. v3 attached, rebased on current master. Compared to v2, the comments have been improved to better explain the relationship between the two code changes and why the approach is safe. Note that the Perl test case included in the patch is only meant as a benchmark script for conveniently reproducing the problem and measuring the improvement - it is not intended as a regression test. Any thoughts or reviews would be greatly appreciated. -- Regards, Jingtang
v3-0001-Optimize-CPU-usage-of-dropping-buffers-during-recove.patch
Description: Binary data
