On Tue, Mar 17, 2026 at 12:26 AM Michael Paquier <[email protected]> wrote: > This stuff seems sensible enough that I think we should at least have > a test, no? It does not have to be absolutely perfect in terms of > reproducibility, just good enough to be able to detect it across the > buildfarm. We already do various things with page boundaries in WAL > during recovery, and a shutdown could be perhaps timed to increase the > reproducibility rate of the issues discussed?
I initially thought that there was no easy way to trigger this issue
reliably in a test: the script I've been using won't work as soon as
there are changes in the record sizes. Then I remembered that
pg_logical_emit_message existed and could be used to write a WAL
record of a specific size, without allocating a xid and without
flushing the record.
With this, the test can be simplified to:
SELECT pg_switch_wal();
BEGIN;
SELECT pg_logical_emit_message(false, '', repeat('a', 16265), false);
ROLLBACK;
Any change in WAL short header, long header or xl_logical_message
struct will "break" the test since the record won't be at the exact
end of the page boundary. This also assumes that we have an 8 byte
alignment. 32 bits machine will have the WAL record ends at 3FF0, so
not exactly the end, but that should be fine to test different
conditions.
A word of caution about this test: While running it on my machine,
I've managed to trigger some weird WAL corruption. The new segment
after the switch had 1 or 2 excessive bytes at the start of the
segment just before the xlog page magic, shifting the whole file. The
first time it happened, I thought I'd messed something up and added
the bytes myself while looking at the WAL with imhex. The second time,
I've only run the script, and the new segment had a 1.1MB size shortly
after, so I'm pretty sure I didn't do anything that could have
introduced those excessive bytes.
I'm still trying to understand the trigger conditions (some race
condition between the switch and the walwriter?), but if this test is
merged, it may trigger this WAL corruption issue on the buildfarm.
Regards,
Anthonin Bonnefoy
v1-0001-Add-test-shutting-down-walsender-with-unflushed-r.patch
Description: Binary data
