Hi Hackers,

While review the patch in the thread [1] I noticed the following:

When the WAL prefetcher encounters a block reference that carries a full
page image (FPW) or has BKPBLOCK_WILL_INIT set, it correctly skips issuing
a prefetch for that block because the old on-disk content is irrelevant
since replay will overwrite or zero the page entirely. However, if a later
WAL record within the look-ahead window references the same block without
an FPW, the prefetcher would still issue a fadvise64 syscall for it,
because the block was never recorded in the duplicate-detection window.

Fixed this by making these blocks as recently seen in the FPW and WILL_INIT
skip paths. The existing duplicate-check loop then naturally suppresses
prefetch attempts for subsequent references to the same block, counting
them under the skip_rep stat. This is particularly effective for workloads
that produce many sequential writes to the same page (e.g., bulk inserts
into heap-only tables), where each page's first post-checkpoint touch
generates an FPW and subsequent inserts to the same page follow shortly
after in WAL.

In order to further improve the wasted prefetch calls, we can try to
increase the window size by changing XLOGPREFETCHER_SEQ_WINDOW_SIZE
according to max blocks that can be prefetched or maintain a hash table. I
did not attempt to do this in this patch because that can impact the redo
performance (more cpu cycles).  Worst case, the current fix may fail in
scenarios where the table has more than four indexes, for example. However,
I still believe it is an improvement over the baseline. If we decide to
spend more cycles on optimizing the window sizes, it can be in a different
patch.

Benchmarked recovery with 10 GB of WAL from insert-only workload into a
no-index table, replayed from an identical crash snapshot:

Fast disk (NVMe)
Baseline: redo 37.30s, system CPU 9.38s, 1,204,992 fadvise calls
Patched: redo 25.78s, system CPU 3.39s, 122,753 fadvise calls

This is nearly 31% faster redo, 90% fewer fadvise syscalls

*Prefetch Counters*
Counter Baseline Patched Delta
prefetch (fadvise issued) 1,204,992 122,753 −89.8%
hit 924,457 911,785 −1.4%
skip_init 1,097,536 1,097,536 0
skip_fpw 28 28 0
skip_rep 80,020,209 81,115,120 +1,094,911

Slower disk (with ~2ms latency)
Baseline: redo 188.04s, system CPU 6.87s, 1,204,992 fadvise calls
Patched: redo 60.02s, system CPU 3.39s, 122,753 fadvise calls

This is nearly 68% faster redo, 3.1× overall speedup


*Configuration:*

shared_buffers = '124GB'
huge_pages = on
wal_buffers = '512MB'
max_wal_size = '100GB'
checkpoint_timeout = '30min'
full_page_writes = on
maintenance_io_concurrency = 50
recovery_prefetch = on

*Workload:*
CREATE TABLE test_noindex(id bigint, val1 int, val2 int, payload text);
-- No indexes, no primary key.


-- Then insert in batches of 1M rows until WAL reaches 10 GB:
INSERT INTO test_noindex
SELECT g, (g*7+13)%100000, (g*31+17)%100000, repeat(chr(65+(g%26)),60)
FROM generate_series(1, 1000000) g;


Thanks,
Satya

[1]
https://www.postgresql.org/message-id/flat/CA%2B3i_M8C%2BrK9vhwBm8U%2Bys2hbDifoBb4Xnws5Wmn2f4u7iqOpA%40mail.gmail.com#8eac90e696baf6e4f58f91482af28e07

Attachment: 0001-xlogprefetcher-record-recent-fpw.patch
Description: Binary data

Reply via email to