On Thu, Mar 12, 2026 at 1:05 PM Andres Freund <[email protected]> wrote: > Which in turn made me wonder: Is it actually OK for > visibilitymap_prepare_truncate() to do separate WAL logging from > XLOG_SMGR_TRUNCATE?
I mean, I think this is just a blatant violation of the usual practice around write-ahead logging. We're hoping that the change we're doing here will be atomic with a later change in a separate critical section. I don't see how that's ever going to work out. Functions with "prepare" in the name should be limited to doing things that can be done ahead of the operation for which they are preparing. The actual operation needs to all happen in one critical section after all preparation is complete. > Separately, is it actually sufficient that visibilitymap_prepare_truncate() > only WAL logs the changes if XLogHintBitIsNeeded()? I'm going back and forth > in my head about danger scenarios involving PITR or crashes inbetween > prepare_truncate() and the actual truncation. Since we don't set the page LSN, the changes made in that critical section can make it to disk even if we don't manage to replay the truncate record and fail to make it to disk even if we do. That can only be OK if those changes are never critical. The only thing we're doing here is clearing VM bits, so if that happens when it shouldn't, nothing should break, except maybe synchronization between primaries and standbys. But if it does happen when it shouldn't, that's potentially data-corrupting: the relation can be re-extended, and now you've got pages where the VM bit in the page is clear and the VM bit in the map is set, which is never OK. > ISTM that we should not do visibilitymap_prepare_truncate() during recovery > and always WAL log the prepare_truncate() on the primary. Does that sound > sane? I think the real need here is, um, to just following the regular WAL-logging rules. All the things that need to happen as part of a single operation need to be done in one critical section while holding one set of buffer locks all at the same time, and the WAL record needs to be written in that same critical section. Or alternatively several operations that each follow these rules and together add up to the same thing. For example, it would be fine to treat clearing the tailing bits of the last partial page as a separate operation that happens before the actual truncation and that therefore has its own WAL record, buffer locks, and critical section, provided that you can keep any of those bits from being set again before the actual truncation happens, e.g. by hanging onto the buffer lock. During replay, there can't be any concurrent writers, so replay can be fully consecutive. I feel like what's basically happened here is that an attempt was made to hide too many of the details under the visibilitymap_* layer. That's nice for abstraction, but it's not so great for correctness. -- Robert Haas EDB: http://www.enterprisedb.com
