On Thu, Mar 12, 2026 at 1:05 PM Andres Freund <[email protected]> wrote:
> Which in turn made me wonder: Is it actually OK for
> visibilitymap_prepare_truncate() to do separate WAL logging from
> XLOG_SMGR_TRUNCATE?

I mean, I think this is just a blatant violation of the usual practice
around write-ahead logging. We're hoping that the change we're doing
here will be atomic with a later change in a separate critical
section. I don't see how that's ever going to work out. Functions with
"prepare" in the name should be limited to doing things that can be
done ahead of the operation for which they are preparing. The actual
operation needs to all happen in one critical section after all
preparation is complete.

> Separately, is it actually sufficient that visibilitymap_prepare_truncate()
> only WAL logs the changes if XLogHintBitIsNeeded()?  I'm going back and forth
> in my head about danger scenarios involving PITR or crashes inbetween
> prepare_truncate() and the actual truncation.

Since we don't set the page LSN, the changes made in that critical
section can make it to disk even if we don't manage to replay the
truncate record and fail to make it to disk even if we do. That can
only be OK if those changes are never critical. The only thing we're
doing here is clearing VM bits, so if that happens when it shouldn't,
nothing should break, except maybe synchronization between primaries
and standbys. But if it does happen when it shouldn't, that's
potentially data-corrupting: the relation can be re-extended, and now
you've got pages where the VM bit in the page is clear and the VM bit
in the map is set, which is never OK.

> ISTM that we should not do visibilitymap_prepare_truncate() during recovery
> and always WAL log the prepare_truncate() on the primary.  Does that sound
> sane?

I think the real need here is, um, to just following the regular
WAL-logging rules. All the things that need to happen as part of a
single operation need to be done in one critical section while holding
one set of buffer locks all at the same time, and the WAL record needs
to be written in that same critical section. Or alternatively several
operations that each follow these rules and together add up to the
same thing. For example, it would be fine to treat clearing the
tailing bits of the last partial page as a separate operation that
happens before the actual truncation and that therefore has its own
WAL record, buffer locks, and critical section, provided that you can
keep any of those bits from being set again before the actual
truncation happens, e.g. by hanging onto the buffer lock. During
replay, there can't be any concurrent writers, so replay can be fully
consecutive.

I feel like what's basically happened here is that an attempt was made
to hide too many of the details under the visibilitymap_* layer.
That's nice for abstraction, but it's not so great for correctness.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to