Hi, On 2026-03-06 11:49:20 -0500, Robert Haas wrote: > On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <[email protected]> wrote: > > Which perhaps also should have emitted an FPI when clearing a bit? But I'm > > unsure that that was required at the time. OTOH, it did seem to generate an > > FPI for setting a VM bit, so ... > > After booting up the part of my brain that worked on this back in > 2011, I think I reasoned that clearing a bit was idempotent and didn't > depend on the prior page state, so that we would be adequate protected > without an FPI. I had to add a WAL record to *set* the VM bit, because > that wasn't logged at all, but the actual of clearing the VM bit > piggybacked on the heap change that necessitated doing so, and having > that record also emit an FPI for the VM page didn't seem to me to add > anything. > > So I think technically it's the addition of checksums that turns this > into a real bug, because now torn pages are a checksum-failure hazard. > However, that would have been extremely hard to notice given that I > seem never to have actually documented that reasoning in a comment > anywhere.
And then it got a lot worse with incremental backup support... > I don't think we should mess around with trying to make the behavior > here conditional on wal_level, checksums, etc. We should just do it > all the time to keep the logic simple. Agreed. I've been working on a fix for this (still pretty raw though). What a mess. As part of validating that I added error checks that disallow reads in the startup process for anything other than the FSM fork. Which triggered, due to visibilitymap_prepare_truncate() reading a VM page. Which in turn made me wonder: Is it actually OK for visibilitymap_prepare_truncate() to do separate WAL logging from XLOG_SMGR_TRUNCATE? I suspect not, consider this scenario: 1) primary does XLOG_FPI for the VM page 2) checkpoint 3) primary does XLOG_SMGR_TRUNCATE 4) standby replays XLOG_FPI 5) primary performs a restartpoint 6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does visibilitymap_prepare_truncate, which dirties the VM page 7) standby crashes while writing out the VM page 8) standby does recovery and now finds a torn VM page, triggering a checksum failure Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just wipe out that region of the VM? Separately, is it actually sufficient that visibilitymap_prepare_truncate() only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth in my head about danger scenarios involving PITR or crashes inbetween prepare_truncate() and the actual truncation. ISTM that we should not do visibilitymap_prepare_truncate() during recovery and always WAL log the prepare_truncate() on the primary. Does that sound sane? Greetings, Andres
