Hi,
On 2026-01-28 18:16:10 -0500, Melanie Plageman wrote:
> > > + */
> > > + if ((old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> > > + {
> > > + vacrel->vm_new_visible_pages++;
> > > + if (presult.all_frozen)
> > > + {
> > > + vacrel->vm_new_visible_frozen_pages++;
> > > + *vm_page_frozen = true;
> >
> > Not this patches fault, but I find "vm_new_visible_pages" and
> > "vm_new_visible_frozen_pages" pretty odd names. The concept is all-visible
> > and
> > frozen. The page itself isn't visible or invisible...
>
> I thought having the extra word "all" in there made it too long. And
> since "vm" is there, that isn't set unless the page is
> _all_-visible/all-frozen. But if you think it gives people the wrong
> idea, I am willing to change it. I can omit vm and make it:
> new_all_visible_all_frozen_pages
> new_all_visible_pages
> new_all_frozen_pages
>
> Is that clearer?
Yes, I think so.
> > It's also a bit odd that a function that sounds rather read-only does stuff
> > like clearing VM/all-visible.
>
> I thought about this a lot. Ultimately, I ended up keeping it the way it is.
> I think the other option is changing from this:
>
> do_set_vm = heap_page_will_set_vm(&prstate,
> params->relation,
> blockno, buffer, page,
> vmbuffer,
> params->reason,
> do_prune, do_freeze,
> prstate.lpdead_items,
> &old_vmbits, &new_vmbits);
>
> to this:
>
> heap_page_prepare_vm_set(&prstate,
> params->relation,
> blockno, buffer, page,
> vmbuffer,
> params->reason,
> do_prune, do_freeze,
> prstate.lpdead_items,
> &old_vmbits, &new_vmbits);
>
> do_set_vm = (new_vmbits & VISIBILITYMAP_VALID_BITS) != 0;
>
> or heap_page_plan_vm_set()
> heap_page_will_set_vm() has symmetry with heap_page_will_freeze(), the
> helper that decides whether or not we will freeze tuples. I like that
> symmetry since heap_page_will_set_vm() decides whether or not to set
> the VM.
>
> Now, heap_page_plan/prepare_vm_set() does indirectly hint that
> something like clearing VM/all-visible could happen -- if you
> understand that preparing the VM to have bits set also includes
> clearing any existing corruption. And "prepare" or "plan" has more
> symmetry with prune_freeze_plan() -- though that function does not
> make changes on the page.
>
> Ultimately, clearing the VM/page of corruption is pretty anomalous
> from the rest of the code in heap_page_prune_and_freeze(). All other
> changes to the page are done in a single critical section at the
> bottom of the function.
>
> I could see an argument for moving identify_and_fix_vm_corruption()
> out of the helper and into heap_page_prune_and_freeze() but then we'd
> have to move visibilitymap_get_status() out too. And that takes away a
> lot of the benefit of encapsulating all that logic.
I was wondering about that option. Relatedly, I also was wondering if we ought
to do identify_and_fix_vm_corruption() regardless of ->attempt_update_vm.
> > Why are we not doing fixing up of the page *before* we prune it? It's a bit
> > insane that we do the WAL logging for pruning, which in turn will often
> > include an FPI, before we do the fixups. The fixes aren't WAL logged, so
> > this
> > actually leads to the standby getting further out of sync.
> >
> > I realize this isn't your mess, but brrr.
>
> Well, after this patch set, clearing the VM does happen before we emit
> WAL for pruning.
That I think is a substantial improvement, the current (i.e. before your
series) placement really is pretty insane due to the guaranteed divergence it
causes.
I wonder if we actually should just force an FPI whenever we detect such
corruption, that way it would reliably fixed on the standby as well.
> It wouldn't be hard to move the corruption fixups to the beginning of
> heap_page_prune_and_freeze() in the new code structure.
As identify_and_fix_vm_corruption() needs lpdead_items, I'm not sure that's
true?
I wonder if at least the warning for the "(PageIsAllVisible(heap_page) &&
nlpdead_items > 0)" test should be moved to
heap_prune_record_dead_or_unused(). That way the WARNING could include the
offset number and it'd also work in the mark_unused_now case.
Perhaps it also should trigger for RECENTLY_DEAD, INSERT_IN_PROGRESS,
DELETE_IN_PROGRESS?
At that point the !page_all_visible && vm_all_visible part could indeed be
moved to the start of heap_page_prune_and_freeze()
> But it would split visibility map-related logic into two parts of
> heap_page_prune_and_freeze().
I'm not convinced that that's an issue. new/old_vmbits area already variables at
the level of heap_page_prune_and_freeze(), determining it a bit earlier seems
not a problem. And I think doing this check even if !attempt_update_vm might
very well make sense.
> Would it be worth it? What benefit would we get? Do you just feel that it
> should logically come first?
One insanity is that right now we will process all frozen pages over and over
due to he skip pages threshold, wasting a *lot* of CPU and memory bandwidth.
It'd be quite defensible to just skip processing the page once we determined
it's already all frozen. But for that we'd probably want to do the
"page_all_visible && vm_all_visible" check before returning...
> > Do we actually forsee a case where only one of HEAP_PAGE_PRUNE_FREEZE |
> > HEAP_PAGE_PRUNE_UPDATE_VM would be set?
>
> Yes, when setting the VM on-access, it is too expensive to call
> heap_prepare_freeze_tuple() on each tuple. I could work on trying to
> optimize it, but it isn't currently viable.
Is it too expensive to do so even when we already decided to do some pruning?
I am not surprised it's too expensive when there's not even a dead tuple on
the page. But I am mildly surprised if it's too expensive to do when we'd WAL
log anyway?
> > > From cdf5776fadeae3430c692999b37f8a7ec944bda1 Mon Sep 17 00:00:00 2001
> > > From: Melanie Plageman <[email protected]>
> > > Date: Tue, 2 Dec 2025 16:16:22 -0500
>
> > > +static TransactionId
> > > +get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
> > > + uint8 old_vmbits, uint8 new_vmbits,
> > > + TransactionId latest_xid_removed,
> > > TransactionId frz_conflict_horizon,
> > > + TransactionId visibility_cutoff_xid)
> > > +{
> >
> > The logic for horizons is now split between this and "Calculate what the
> > snapshot conflict horizon should be for a record" in
> > heap_page_will_freeze().
>
> That is true in master too. We determine frz_conflict_horizon in
> heap_page_will_freeze() and later before emitting the WAL record
> decide which of the latest_xid_removed and frz_conflict_horizon that
> we should use as the snapshot conflict horizon for the combined
> record.
I'm not sure that confusing code in master is a particularly good reason for
anything...
> All I've done is expand that part (the part before emitting the WAL
> record) a bit because now we have to consider what the horizon would
> be if we set the VM.
>
> If I really wanted to calculate it only in a single place, I could
> maintain a new variable, all_frozen_except_dead, and remove the
> frz_conflict_horizon from heap_page_will_freeze(). Then, in
> get_conflict_xid(), I could have the following logic:
>
> if (do_set_vm)
> conflict_xid = visibility_cutoff_xid;
> else if (do_freeze)
> {
> if (all_frozen_except_dead)
> conflict_xid = visibility_cutoff_xid;
> else
> {
> conflict_xid = OldestXmin;
> TransactionIdRetreat(conflict_xid);
> }
> }
> else
> conflict_xid = InvalidTransactionId;
>
> I think using all_frozen_except_dead while maintaining
> visibility_cutoff_xid (in heap_prune_record_unchanged_lp_normal()) has
> the potential to be confusing, though. We'd need to keep updating
> visibility_cutoff_xid when all_visible is false but
> all_frozen_except_dead is true as well as when all_visible is true.
> And because we don't care about all_visible_except_dead, it gets even
> more confusing to make sure we are maintaining the right variables in
> the right situations.
I suspect we should just track all of the horizons/cutoffs all the time. This
whole stuff about optimizing out a few conditional assignments complicates the
code substantially and feels extremely error prone to me.
I probably complained about this before, and it's not this patch's fault, but
PruneState->{all_visible,all_frozen} are imo confusingly named, due to
sounding like they describe the current state, rather than the possible state
after pruning. It's not helped by this comment:
* NOTE: all_visible and all_frozen initially don't include LP_DEAD
items.
* That's convenient for heap_page_prune_and_freeze() to use them to
* decide whether to opportunistically freeze the page or not. The
* all_visible and all_frozen values ultimately used to set the VM are
* adjusted to include LP_DEAD items after we determine whether or not
to
* opportunistically freeze.
"all-visible ... are adjusted to include LP_DEAD" ... - just reading that it's
hard to know what it means.
> > Although I guess I don't understand that code:
> >
> > /*
> > * Calculate what the snapshot conflict horizon should be
> > for a record
> > * freezing tuples. We can use the visibility_cutoff_xid as
> > our cutoff
> > * for conflicts when the whole page is eligible to become
> > all-frozen
> > * in the VM once we're done with it. Otherwise, we
> > generate a
> > * conservative cutoff by stepping back from OldestXmin.
> > */
> > if (prstate->all_frozen)
> > prstate->frz_conflict_horizon =
> > prstate->visibility_cutoff_xid;
> > else
> > {
> > /* Avoids false conflicts when hot_standby_feedback
> > in use */
> > prstate->frz_conflict_horizon =
> > prstate->cutoffs->OldestXmin;
> > TransactionIdRetreat(prstate->frz_conflict_horizon);
> > }
> >
> > Why does it make sense to use OldestXmin? Consider e.g. the case where there
> > is one very old tuple that needs to be frozen and one new live tuple on a
> > page. Because of the new tuple we can't mark the page all-frozen. But
> > there's
> > also no reason to not use much less aggressive horizon than OldestXmin,
> > namely
> > the newer of xmin,xmax of the old frozen tuple?
>
> We don't track the newest frozen xmin right now. Doing so wouldn't be
> free (i.e. more comparisons which may matter in a query without much
> other overhead).
I have an extremely hard time to believe that tracking the Min(xmin) over all
rows is going to be noticeable in comparison to all the overheads in pruning
(including pruning where we do nothing).
There are some checks that would be different, in particular,
heap_pre_freeze_checks() is quite expensive, because it forces an access to
the SLRU, without using hint bits. But that's a check that we wisely
(79d4bf4eff14) do only after we already decided to freeze.
The first thing to improve pruning performance that I would do is to introduce
a fastpath for pages that a) area already frozen b) do not have dead items (if
we're not freezing). Iterating through HOT chains is far from cheap, and if
all rows are live, there's not really a point in doing so. This is
particulary important for VACUUMs where we end up freezing a ton of pages that
are already frozen, due to the silly skip_pages_threshold thing.
> The only purpose it would serve is to make the snapshot conflict
> horizon more accurate/more aggressive when we freeze tuples, which
> would lead to canceling less queries than master -- which is outside
> the purview of this patch.
I think it'd also serve to actually make the choice of the horizon a heck of a
lot easier to understand. Using approximations of the accurate value means
you have to think about why that approximation is correct, whether it's too
approximate etc.
> There's also a set of complications around maintaining this number
> accurately mentioned by Peter in [1].
>
> [1]
> https://www.postgresql.org/message-id/CAH2-WzkB-Pt3zPeTXvMik6jcJn%2BdcpUqO-tt_hc13bD6sGRLPg%40mail.gmail.com
Maybe I'm confused, but what does that email have to do with determining an
accurate horizon? It argues that we can remove / freeze xmax in some
situations, but doesn't seem to talk about an accurate horizon, except in an
aside about not needing to take aborted xmins into account?
> From e591bf061ee673f3750d1180673e1ab48be43bb8 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Tue, 27 Jan 2026 16:53:11 -0500
> Subject: [PATCH v34 03/14] Move VM assert into prune/freeze code and simplify
> returned values
>
> After pruning and freezing, we do an assert-only validatation that the
> page's visibility status matches what we found during the pruning and
> freezing pass over the page.
>
> There's no reason to wait until lazy_scan_prune() to do this validation,
> as all of the VM setting logic has already been moved to
> heap_page_prune_and_freeze().
It's a bit funny to say "wait until lazy_scan_prune()", when lazy_scan_prune()
is what calls heap_page_prune_and_freeze(). I'd just say that moving it to
heap_page_prune_and_freeze() avoids having to pass information around and
will, in the future, allow the check to be performed when freezing during
on-access.
> From a94267babeedec6705fd7f3b43242c6ba0e458c0 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Tue, 2 Dec 2025 16:16:22 -0500
> Subject: [PATCH v34 04/14] Eliminate XLOG_HEAP2_VISIBLE from vacuum phase I
> prune/freeze
> @@ -804,6 +809,62 @@ heap_page_will_freeze(Relation relation, Buffer buffer,
> return do_freeze;
> }
>
> +/*
> + * Calculate the conflict horizon for the whole XLOG_HEAP2_PRUNE_VACUUM_SCAN
> + * or XLOG_HEAP2_PRUNE_ON_ACCESS record.
> + */
> +static TransactionId
> +get_conflict_xid(bool do_prune, bool do_freeze, bool do_set_vm,
> + uint8 old_vmbits, uint8 new_vmbits,
> + TransactionId latest_xid_removed,
> TransactionId frz_conflict_horizon,
> + TransactionId visibility_cutoff_xid)
> +{
> + TransactionId conflict_xid;
> +
> + /*
> + * We can omit the snapshot conflict horizon if we are not pruning or
> + * freezing any tuples and are setting an already all-visible page
> + * all-frozen in the VM.
Maybe mention when this can happen, because it's not immediately obvious.
> In this case, all of the tuples on the page must
> + * already be seen as frozen by all MVCC snapshots on the standby.
Maybe + " (any conflict would have been handled in reaction to the WAL record
for freezing those tuples)" or such?
> + */
> + if (!do_prune &&
> + !do_freeze &&
> + (old_vmbits & VISIBILITYMAP_ALL_VISIBLE) != 0 &&
> + (new_vmbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> + return InvalidTransactionId;
Those != 0 check seem vestigial to me. Just checking for
& VISIBILITYMAP_ALL_VISIBLE is sufficient.
> + /*
> + * The snapshot conflict horizon for the whole record should be the most
> + * conservative of all the horizons calculated for any of the possible
> + * modifications. If this record will prune tuples, any transactions on
> + * the standby older than the youngest xmax of the most recently removed
> + * tuple this record will prune will conflict.
Why just xmax? You can have cases where xmin is a newer xid than xmax.
> If this record will freeze
> + * tuples, any transactions on the standby with xids older than the
> + * youngest tuple this record will freeze will conflict.
Transactions on the standby have no xid. Did you mean xmin?
> + * If we are setting the VM, the conflict horizon is almost always the
> + * visibility cutoff XID, except in the situation described above.
> + *
> + * By picking the newest of all of those, we can ensure that all changes
> + * in the record have been taken into account.
> + */
Comment seems better than before!
> if (do_set_vm)
> conflict_xid = visibility_cutoff_xid;
> else if (do_freeze)
> conflict_xid = frz_conflict_horizon;
> else
> conflict_xid = InvalidTransactionId;
Could it be worth checking that if (do_set_vm && do_freeze) the
frz_conflict_horizon won't "violated" by using visibility_cutoff_xid instead?
> Subject: [PATCH v34 06/14] Remove XLOG_HEAP2_VISIBLE entirely
Did not again look at this, as it seems like it ought to not be
interesting... LMK if I should have.
> From a64707f1f2fa88d7292f7a2f2a760c613eea4950 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 17 Dec 2025 13:57:16 -0500
> Subject: [PATCH v34 07/14] Simplify heap_page_would_be_all_visible visibility
> check
>
> heap_page_would_be_all_visible() doesn't care about the distinction
> between HEAPTUPLE_RECENTLY_DEAD and HEAPTUPLE_DEAD tuples -- any tuple
> that is not HEAPTUPLE_LIVE means the page is not all-visible and causes
> us to return false.
>
> Therefore, we don't need to call HeapTupleSatisfiesVacuum(), which
> includes an extra step to distinguish between dead and recently dead
> tuples using OldestXmin. Replace it with the more minimal
> HeapTupleSatisfiesVacuumHorizon().
>
> This has the added benefit of making it easier to replace uses of
> OldestXmin in heap_page_would_be_all_visible() in the future.
Seems reasonable. Can probably be pulled forward and just committed?
> From 85ab0d4eb681eaba4668ee23602d425c27f56d07 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Mon, 22 Dec 2025 10:46:45 -0500
> Subject: [PATCH v34 08/14] Remove table_scan_analyze_next_tuple unneeded
> parameter OldestXmin
>
> heapam_scan_analyze_next_tuple() doesn't distinguish between dead and
> recently dead tuples when counting them, so it doesn't need OldestXmin.
This part seems quite obviously the right thing to do, it's quite obviously
just wasted effort right now.
> Looking at other table AMs implementing table_scan_analyze_next_tuple(),
> it appears most do not use OldestXmin either.
Does "most" mean some do? Not sure removing a parameter that's unused by
heapam is worth breakage...
> From 8d350868206456f631883a40a955dff480e408d3 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 17 Dec 2025 16:51:05 -0500
> Subject: [PATCH v34 09/14] Use GlobalVisState in vacuum to determine page
> level visibility
>
> [...]
>
> Because comparing a transaction ID against GlobalVisState is more
> expensive than comparing against a single XID, we defer this check until
> after scanning all tuples on the page.
Curious, is this a precaution or was this a measurable bottleneck?
> * The visibility cutoff xid is the newest xmin of live, committed
> tuples
> - * older than OldestXmin on the page. This field is only kept up-to-date
> - * if the page is all-visible. As soon as a tuple is encountered that is
> - * not visible to all, this field is unmaintained. As long as it is
> - * maintained, it can be used to calculate the snapshot conflict horizon
> - * when updating the VM and/or freezing all the tuples on the page.
> + * on the page older than the visibility horizon represented in the
> + * GlobalVisState. This field is only kept up-to-date if the page is
> + * all-visible. It is invalid if there are any tuples on the page that
> are
> + * not visible to all. As long as it is maintained, it can be used to
> + * calculate the snapshot conflict horizon when updating the VM and/or
> + * freezing all the tuples on the page.
> */
> prstate->visibility_cutoff_xid = InvalidTransactionId;
> }
> @@ -1077,6 +1078,24 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
> prune_freeze_plan(RelationGetRelid(params->relation),
> buffer, &prstate, off_loc);
>
> + /*
> + * After processing all the live tuples on the page, if the newest xmin
> + * amongst them may be considered running by any snapshot, the page
> cannot
> + * be all-visible.
> + */
> + if (prstate.all_visible &&
> + TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
Any reason to test IsNormal rather than just IsValid()? There should never be
a reason it's a valid but not "normal" xid, right?
> @@ -1794,28 +1812,15 @@ heap_prune_record_unchanged_lp_normal(Page page,
> PruneState *prstate, OffsetNumb
> }
>
> /*
> - * The inserter definitely committed. But is
> it old enough
> - * that everyone sees it as committed? A
> FrozenTransactionId
> - * is seen as committed to everyone.
> Otherwise, we check if
> - * there is a snapshot that considers this xid
> to still be
> - * running, and if so, we don't consider the
> page all-visible.
> + * The inserter definitely committed. But we
> don't know if it
> + * is old enough that everyone sees it as
> committed. Later,
> + * after processing all the tuples on the page,
> we'll check if
> + * there is any snapshot that still considers
> the newest xid
> + * on the page to be running. If so, we don't
> consider the
> + * page all-visible.
> */
> xmin = HeapTupleHeaderGetXmin(htup);
>
> - /*
> - * For now always use prstate->cutoffs for this
> test, because
> - * we only update 'all_visible' and
> 'all_frozen' when freezing
> - * is requested. We could use
> GlobalVisTestIsRemovableXid
> - * instead, if a non-freezing caller wanted to
> set the VM bit.
> - */
> - Assert(prstate->cutoffs);
> - if (!TransactionIdPrecedes(xmin,
> prstate->cutoffs->OldestXmin))
> - {
> - prstate->all_visible = false;
> - prstate->all_frozen = false;
> - break;
> - }
> -
> /* Track newest xmin on page. */
> if (TransactionIdFollows(xmin,
> prstate->visibility_cutoff_xid) &&
> TransactionIdIsNormal(xmin))
Kinda wonder if this cod eshould be in something like
heap_prune_record_freezable() or such, rather than be inside
heap_prune_record_unchanged_lp_normal().
> Subject: [PATCH v34 10/14] Unset all_visible sooner if not freezing
>
> In the prune/freeze path, we currently delay clearing all_visible and
> all_frozen in the presence of dead items to allow opportunistic
> freezing.
>
> However, if no freezing will be attempted, there’s no need to delay.
> Clearing the flags earlier avoids extra bookkeeping in
> heap_prune_record_unchanged_lp_normal(). This currently has no runtime
> effect because all callers that consider setting the VM also prepare
> freeze plans, but upcoming changes will allow on-access pruning to set
> the VM without freezing. The extra bookkeeping was noticeable in a
> profile of on-access VM setting.
What workload was that?
Theoretically, even if we don't freeze, the page still may be all-visible or
all frozen after the removal of dead items, no? Practically that won't happen,
because we don't remove dead items in any of the relevant paths, but from the
commit message and comments that's not entirely clear.
> Subject: [PATCH v34 11/14] Track which relations are modified by a query
>
> Save the relids in a bitmap in the estate. A later commit will pass this
> information down to scan nodes to control whether or not the scan allows
> setting the visibility map while on-access pruning. We don't want to set
> the visibility map if the query is just going to modify the page
> immediately after.
>
> Reviewed-by: Chao Li <[email protected]>
> diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
> index f8053d9e572..1e3cd73cf27 100644
> --- a/src/include/nodes/execnodes.h
> +++ b/src/include/nodes/execnodes.h
> @@ -678,6 +678,12 @@ typedef struct EState
> *
> ExecDoInitialPruning() */
> const char *es_sourceText; /* Source text from QueryDesc */
>
> + /*
> + * RT indexes of relations modified by the query through a
> + * UPDATE/DELETE/INSERT/MERGE or targeted by a SELECT FOR UPDATE.
> + */
> + Bitmapset *es_modified_relids;
> +
Other EState fields are initialized in CreateExecutorState, this isn't afaict?
Wonder if it's worth adding a crosscheck somewhere, verifying that if a
relation is modified, it's in es_modified_relids. Otherwise this could very
well silently get out of date.
Also, there's some overlap between the informtion collected this way, and
AcquireExecutorLocks(), ScanQueryForLocks(), which determine the needed lock
modes via rte->rellockmode.
> From 8205b2d7da0c3ad3cbc5cead336ced677996b37d Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 3 Dec 2025 15:12:18 -0500
> Subject: [PATCH v34 12/14] Pass down information on table modification to scan
> node
>
> Pass down information to sequential scan, index [only] scan, and bitmap
> table scan nodes on whether or not the query modifies the relation being
> scanned. A later commit will use this information to update the VM
> during on-access pruning only if the relation is not modified by the
> query.
Perhaps worth splitting up, so the addition of the 0 flag is separate from the
the read only hint aspect.
Unfortunately ran out of time for the last two patches.
Greetings,
Andres Freund