Re: Why clearing the VM doesn't require registering vm buffer in wal record

Andres Freund Thu, 12 Mar 2026 10:05:41 -0700

Hi,

On 2026-03-06 11:49:20 -0500, Robert Haas wrote:
> On Thu, Mar 5, 2026 at 7:43 PM Andres Freund <[email protected]> wrote:
> > Which perhaps also should have emitted an FPI when clearing a bit? But I'm
> > unsure that that was required at the time. OTOH, it did seem to generate an
> > FPI for setting a VM bit, so ...
> 
> After booting up the part of my brain that worked on this back in
> 2011, I think I reasoned that clearing a bit was idempotent and didn't
> depend on the prior page state, so that we would be adequate protected
> without an FPI. I had to add a WAL record to *set* the VM bit, because
> that wasn't logged at all, but the actual of clearing the VM bit
> piggybacked on the heap change that necessitated doing so, and having
> that record also emit an FPI for the VM page didn't seem to me to add
> anything.
> 
> So I think technically it's the addition of checksums that turns this
> into a real bug, because now torn pages are a checksum-failure hazard.
> However, that would have been extremely hard to notice given that I
> seem never to have actually documented that reasoning in a comment
> anywhere.


And then it got a lot worse with incremental backup support...


> I don't think we should mess around with trying to make the behavior
> here conditional on wal_level, checksums, etc. We should just do it
> all the time to keep the logic simple.

Agreed.


I've been working on a fix for this (still pretty raw though).  What a mess.


As part of validating that I added error checks that disallow reads in the
startup process for anything other than the FSM fork.  Which triggered, due to
visibilitymap_prepare_truncate() reading a VM page.

Which in turn made me wonder: Is it actually OK for
visibilitymap_prepare_truncate() to do separate WAL logging from
XLOG_SMGR_TRUNCATE?

I suspect not, consider this scenario:

1) primary does XLOG_FPI for the VM page
2) checkpoint
3) primary does XLOG_SMGR_TRUNCATE
4) standby replays XLOG_FPI
5) primary performs a restartpoint
6) primary replays XLOG_SMGR_TRUNCATE, as part of that it again does
   visibilitymap_prepare_truncate, which dirties the VM page
7) standby crashes while writing out the VM page
8) standby does recovery and now finds a torn VM page, triggering a checksum
   failure

Perhaps that's kinda ok, because vm_readbuf() uses ZERO_ON_ERROR, so we'd just
wipe out that region of the VM?


Separately, is it actually sufficient that visibilitymap_prepare_truncate()
only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
in my head about danger scenarios involving PITR or crashes inbetween
prepare_truncate() and the actual truncation.


ISTM that we should not do visibilitymap_prepare_truncate() during recovery
and always WAL log the prepare_truncate() on the primary.  Does that sound
sane?


Greetings,

Andres

Re: Why clearing the VM doesn't require registering vm buffer in wal record

Reply via email to