Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Andres Freund Wed, 18 Mar 2026 10:15:14 -0700

Hi,

On 2026-03-17 10:48:55 -0400, Melanie Plageman wrote:
> @@ -277,6 +295,16 @@ heap_page_prune_opt(Relation relation, Buffer buffer, 
> Buffer *vmbuffer)
>               {
>                       OffsetNumber dummy_off_loc;
>                       PruneFreezeResult presult;
> +                     PruneFreezeParams params;
> +
> +                     visibilitymap_pin(relation, 
> BufferGetBlockNumber(buffer), vmbuffer);


We also do a BufferGetBlockNumber(buffer) in prune_freeze_setup().  It irks me
a bit to do that twice, but I don't see a non-ugly way to avoid that.


> +                     params.relation = relation;
> +                     params.buffer = buffer;
> +                     params.vmbuffer = *vmbuffer;
> +                     params.reason = PRUNE_ON_ACCESS;
> +                     params.vistest = vistest;
> +                     params.cutoffs = NULL;
>  

>                        * We don't pass the HEAP_PAGE_PRUNE_MARK_UNUSED_NOW 
> option
> @@ -284,14 +312,7 @@ heap_page_prune_opt(Relation relation, Buffer buffer, 
> Buffer *vmbuffer)
>                        * cannot safely determine that during on-access 
> pruning with the
>                        * current implementation.
>                        */
> -                     PruneFreezeParams params = {
> -                             .relation = relation,
> -                             .buffer = buffer,
> -                             .reason = PRUNE_ON_ACCESS,
> -                             .options = 0,
> -                             .vistest = vistest,
> -                             .cutoffs = NULL,
> -                     };
> +                     params.options = 0;
>  
>                       heap_page_prune_and_freeze(&params, &presult, 
> &dummy_off_loc,
>                                                                          
> NULL, NULL);

Why does this change the way the PruneFreezeParams variable is defined?  I
don't really mind, it's just a bit confusing.



> +/*
> + * Helper to fix visibility-related corruption on a heap page and its
> + * corresponding VM page. An all-visible page cannot have dead items nor can
> + * it have tuples that are not visible to all running transactions. It clears
> + * the VM corruption as well as resetting the vmbits used during pruning.

So this is now only called when we already know there's corruption?  I think
that could be clearer in the comments.


Seems a bit odd that the function then figures out what it should do from the
page & VM contents, given that the caller already needs to have known that
something is wrong?


> + * This function must be called while holding an exclusive lock on the heap
> + * buffer, and any dead items must have been discovered under that same lock.
> + * Although we do not hold a lock on the VM buffer, it is pinned, and the 
> heap
> + * buffer is exclusively locked, ensuring that no other backend can update 
> the
> + * VM bits corresponding to this heap page.
> + *
> + * This function makes changes to the VM and, potentially, the heap page, but
> + * it does not need to be done in a critical section.
> + */
> +static void
> +heap_fix_vm_corruption(PruneState *prstate, OffsetNumber offnum)
> +{
> +     const char *relname = RelationGetRelationName(prstate->relation);
> +
> +     Assert(BufferIsLockedByMeInMode(prstate->buffer, 
> BUFFER_LOCK_EXCLUSIVE));
> +
> +     if (PageIsAllVisible(prstate->page))
> +     {
> +             /*
> +              * It's possible for the value returned by
> +              * GetOldestNonRemovableTransactionId() to move backwards, so 
> it's not
> +              * wrong for us to see tuples that appear to not be visible to
> +              * everyone yet, while PD_ALL_VISIBLE is already set. The real 
> safe
> +              * xmin value never moves backwards, but
> +              * GetOldestNonRemovableTransactionId() is conservative and 
> sometimes
> +              * returns a value that's unnecessarily small, so if we see that
> +              * contradiction it just means that the tuples that we think 
> are not
> +              * visible to everyone yet actually are, and the PD_ALL_VISIBLE 
> flag
> +              * is correct.
> +              *
> +              * However, there should never be LP_DEAD items, dead tuple 
> versions,
> +              * or tuples inserted by an in-progress transaction on a page 
> with
> +              * PD_ALL_VISIBLE set.
> +              */
> +             if (prstate->lpdead_items > 0)
> +             {
> +                     ereport(WARNING,
> +                                     (errcode(ERRCODE_DATA_CORRUPTED),
> +                                      errmsg("dead line pointer found on 
> page marked all-visible"),
> +                                      errcontext("relation \"%s\", page %u, 
> tuple %u",
> +                                                             relname, 
> prstate->block, offnum)));
> +             }
> +             else
> +             {
> +                     ereport(WARNING,
> +                                     (errcode(ERRCODE_DATA_CORRUPTED),
> +                                      errmsg("tuple not visible to all 
> transactions found on page marked all-visible"),
> +                                      errcontext("relation \"%s\", page %u, 
> tuple %u",
> +                                                             relname, 
> prstate->block, offnum)));
> +             }

Wait, why are we now WARNING about the PageIsAllVisible() &&
prstate->lpdead_items == 0 case? Seems to run flatly counter to the comment
above about GetOldestNonRemovableTransactionId() going backward?


> +             /*
> +              * Mark the buffer dirty now in case we make no further changes 
> and
> +              * therefore would not mark it dirty later.
> +              */
> +             PageClearAllVisible(prstate->page);
> +             MarkBufferDirtyHint(prstate->buffer, true);
> +     }
> +     else if (prstate->vmbits & VISIBILITYMAP_VALID_BITS)
> +     {
> +             /*
> +              * As of PostgreSQL 9.2, the visibility map bit should never be 
> set if
> +              * the page-level bit is clear. However, for vacuum, it's 
> possible
> +              * that the bit got cleared after heap_vac_scan_next_block() was
> +              * called, so we must recheck now that we have the buffer lock 
> before
> +              * concluding that the VM is corrupt.
> +              */
> +             ereport(WARNING,
> +                             (errcode(ERRCODE_DATA_CORRUPTED),
> +                              errmsg("page is not marked all-visible but 
> visibility map bit is set"),
> +                              errcontext("relation \"%s\", page %u",
> +                                                     relname, 
> prstate->block)));
> +     }
> +
> +     visibilitymap_clear(prstate->relation, prstate->block, 
> prstate->vmbuffer,
> +                                             VISIBILITYMAP_VALID_BITS);
> +     prstate->vmbits = 0;

So we can end up clearing the VM without emitting any warning?


>  /*
>   * Prune and repair fragmentation and potentially freeze tuples on the
> @@ -830,6 +941,10 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>                                          new_relfrozen_xid, new_relmin_mxid,
>                                          presult, &prstate);
>  
> +     if ((prstate.vmbits & VISIBILITYMAP_VALID_BITS) &&
> +             !PageIsAllVisible(prstate.page))
> +             heap_fix_vm_corruption(&prstate, InvalidOffsetNumber);
> +
>       /*
>        * Examine all line pointers and tuple visibility information to 
> determine
>        * which line pointers should change state and which tuples may be 
> frozen.

Feels like there should be an explanation here for why we are clearing the VM?




> From a503285e012de12539df384d615675c1e48e5cfd Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 25 Feb 2026 16:48:19 -0500
> Subject: [PATCH v40 02/12] Add pruning fast path for all-visible and
>  all-frozen pages
> 
> Because of the SKIP_PAGES_THRESHOLD optimization or a stale prune XID,
> heap_page_prune_and_freeze() can be invoked for pages with no pruning or
> freezing work. To avoid this, if a page is already all-frozen or it is
> all-visible and no freezing will be attempted, we exit early. We can't
> exit early if vacuum passed DISABLE_PAGE_SKIPPING, though.
> 



> +static void
> +heap_page_bypass_prune_freeze(PruneState *prstate, PruneFreezeResult 
> *presult)
> +{
> +     OffsetNumber maxoff = PageGetMaxOffsetNumber(prstate->page);
> +     Page            page = prstate->page;
> +
> +     Assert(prstate->vmbits & VISIBILITYMAP_ALL_FROZEN ||
> +                (prstate->vmbits & VISIBILITYMAP_ALL_VISIBLE &&
> +                     !prstate->attempt_freeze));
> +
> +     /* We'll fill in presult for the caller */
> +     memset(presult, 0, sizeof(PruneFreezeResult));
> +
> +     presult->vmbits = prstate->vmbits;
> +
> +     /* Clear any stale prune hint */
> +     if (TransactionIdIsValid(PageGetPruneXid(page)))
> +     {
> +             PageClearPrunable(page);
> +             MarkBufferDirtyHint(prstate->buffer, true);
> +     }
> +
> +     if (PageIsEmpty(page))
> +             return;
> +
> +     presult->hastup = true;

Is that actually a given? Couldn't the page consist solely out of unused
items? That'd make PageIsEmpty() return false, but should still allow
truncation.





> From 255fc9aeb721ba96ee3a7b7c3e675a4ee11087d6 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 17 Dec 2025 16:51:05 -0500
> Subject: [PATCH v40 03/12] Use GlobalVisState in vacuum to determine page
>  level visibility
> 
> During vacuum's first and third phases, we examine tuples' visibility
> to determine if we can set the page all-visible in the visibility map.
> 
> Previously, this check compared tuple xmins against a single XID chosen at
> the start of vacuum (OldestXmin). We now use GlobalVisState, which also
> enables future work to set the VM during on-access pruning, since ordinary
> queries have access to GlobalVisState but not OldestXmin.
> 
> This also benefits vacuum: in some cases, GlobalVisState may advance
> during a vacuum, allowing more pages to become considered all-visible.
> And, in the future, we could easily add a heuristic to update
> GlobalVisState more frequently during vacuums of large tables.
> 
> OldestXmin is still used for freezing and as a backstop to ensure we
> don't freeze a dead tuple that wasn't yet prunable according to
> GlobalVisState in the rare occurrences where GlobalVisState moves
> backwards.

> Because comparing a transaction ID against GlobalVisState is more
> expensive than comparing against a single XID, we defer this check until
> after scanning all tuples on the page. Therefore, we perform the
> GlobalVisState check only once per page. This is safe because
> visibility_cutoff_xid records the newest live xmin on the page;
> if it is globally visible, then the entire page is all-visible.
> 
> Using GlobalVisState means on-access pruning can also maintain
> visibility_cutoff_xid. This approach will result in examining more tuple
> xmins than before; however, the additional cost should not be
> significant. And doing so will enable us to set the visibility map on
> access in the future.


I wish there were a good way to trigger errors if visibility_cutoff_xid were
ever read after prstate->set_all_frozen is set to false... But I guess that'll
be moot in a few commits.



> diff --git a/src/backend/access/heap/pruneheap.c 
> b/src/backend/access/heap/pruneheap.c
> index bf740c37f3d..c85e4172ee8 100644
> --- a/src/backend/access/heap/pruneheap.c
> +++ b/src/backend/access/heap/pruneheap.c
> @@ -1043,6 +1043,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>        */
>       prune_freeze_plan(&prstate, off_loc);
>  
> +     /*
> +      * After processing all the live tuples on the page, if the newest xmin
> +      * amongst them may be considered running by any snapshot, the page 
> cannot
> +      * be all-visible.
> +      */
> +     if (prstate.set_all_visible &&
> +             TransactionIdIsNormal(prstate.visibility_cutoff_xid) &&
> +             GlobalVisTestXidMaybeRunning(prstate.vistest,
> +                                                                      
> prstate.visibility_cutoff_xid))
> +             prstate.set_all_visible = prstate.set_all_frozen = false;
> +

So the docs for prstate.visibility_cutoff_xid say:

         * visibility_cutoff_xid is the newest xmin of live tuples on the page.
         * The caller can use it as the conflict horizon, when setting the VM
         * bits.  It is only valid if we froze some tuples, and set_all_frozen 
is
         * true.

But here we look at it without checking that we froze some tuples.  I guess
the comment is outdated?



> @@ -3615,7 +3612,7 @@ heap_page_is_all_visible(Relation rel, Buffer buf,
>   * Returns true if the page is all-visible other than the provided
>   * deadoffsets and false otherwise.
>   *
> - * OldestXmin is used to determine visibility.
> + * vistest is used to determine visibility.
>   *
>   * Output parameters:
>   *

Could the "going backward" thing possibly trigger a spurious assert in

        Assert(heap_page_is_all_visible(vacrel->rel, buf,
                                        vacrel->vistest, &debug_all_frozen,
                                        &debug_cutoff, &vacrel->offnum));



> From a1d768a8cea8ac13e250188ec96c01d98acda94a Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Sat, 28 Feb 2026 16:06:51 -0500
> Subject: [PATCH v40 04/12] Keep newest live XID up-to-date even if page not
>  all-visible

I guess I'd have expected 03 and 04 to be swapped... But whatever.


> +      * Currently, only VACUUM performs freezing, but other callers may in 
> the
> +      * future. Other callers must initialize prstate.set_all_frozen to 
> false,
> +      * since we will not call heap_prepare_freeze_tuple() for each tuple.

What does it mean that other callers need to "initialize
prstate.set_all_frozen to false"? It's not like they can do that, because
prstate is defined in heap_page_prune_and_freeze().


> +      * We only consider opportunistic freezing if the page would become
> +      * all-frozen, or if it would be all-frozen except for dead tuples that
> +      * VACUUM will remove.

It kinda feels like "opportunistic freezing" is not defined at this point.  It
wasn't super clear before either, but there was at least this:

-     * In addition to telling the caller whether it can set the VM bit, we
-     * also use 'set_all_visible' and 'set_all_frozen' for our own
-     * decision-making. If the whole page would become frozen, we consider
-     * opportunistically freezing tuples.  We will not be able to freeze the
-     * whole page if there are tuples present that are not visible to everyone
-     * or if there are dead tuples which are not yet removable.  However, dead
-     * tuples which will be removed by the end of vacuuming should not
-     * preclude us from opportunistically freezing.  Because of that, we do

Which seems to provide a bit more explanation than "We only consider
opportunistic freezing"...


> From 05dfe8841e4a90dc595775863d58bacce996d70b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Tue, 2 Dec 2025 15:07:42 -0500
> Subject: [PATCH v40 05/12] Eliminate XLOG_HEAP2_VISIBLE from vacuum phase I
>  prune/freeze

"Eliminate" kinda makes me think this just removes WAL logging for
visibilitymap sets or such. Perhaps consider rephrasing it as something like
"WAL log setting VM as part of XLOG_HEAP2_PRUNE_*"


> This change applies only to vacuum phase I, not to pruning performed
> during normal page access.

Maybe + "For now this ..."


> @@ -215,7 +219,7 @@ static void page_verify_redirects(Page page);
>  
>  static bool heap_page_will_freeze(bool did_tuple_hint_fpi, bool do_prune, 
> bool do_hint_prune,
>                                                                 PruneState 
> *prstate);
> -
> +static bool heap_page_will_set_vm(PruneState *prstate, PruneReason reason);
>  
>  /*
>   * Optionally prune and repair fragmentation in the specified page.

Previously there were two newlines between the declarations and code, now only
one. Intentional?




> +/*
> + * Decide whether to set the visibility map bits (all-visible and all-frozen)
> + * for heap_blk using information from the PruneState and VM.
> + *
> + * This function does not actually set the VM bits or page-level visibility
> + * hint, PD_ALL_VISIBLE.
> + *
> + * Returns true if one or both VM bits should be set and false otherwise.
> + */
> +static bool
> +heap_page_will_set_vm(PruneState *prstate, PruneReason reason)
> +{
> +     /*
> +      * Though on-access pruning maintains prstate->set_all_visible, we don't
> +      * consider setting the VM.
> +      */
> +     if (reason == PRUNE_ON_ACCESS)
> +             return false;

Nitpick^2: We kind of are considering based on this comment :).  I'd just
s/consider setting/set/, maybe with a +for now.


> @@ -956,12 +996,10 @@ heap_page_bypass_prune_freeze(PruneState *prstate, 
> PruneFreezeResult *presult)
>   * tuples if it's required in order to advance relfrozenxid / relminmxid, or
>   * if it's considered advantageous for overall system performance to do so
>   * now.  The 'params.cutoffs', 'presult', 'new_relfrozen_xid' and
> - * 'new_relmin_mxid' arguments are required when freezing.  When
> - * HEAP_PAGE_PRUNE_FREEZE option is passed, we also set
> - * presult->set_all_visible and presult->set_all_frozen after determining
> - * whether or not to opportunistically freeze, to indicate if the VM bits can
> - * be set. 'all-frozen' is always set to false when the 
> HEAP_PAGE_PRUNE_FREEZE
> - * option is not passed.
> + * 'new_relmin_mxid' arguments are required when freezing.
> + *
> + * A vmbuffer corresponding to the heap page is also passed and if the page 
> is
> + * found to be all-visible/all-frozen, we will set it in the VM.
>   *
>   * presult contains output parameters needed by callers, such as the number 
> of
>   * tuples removed and the offsets of dead items on the page after pruning.
> @@ -989,15 +1027,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>       bool            do_freeze;
>       bool            do_prune;
>       bool            do_hint_prune;
> +     bool            do_set_vm;
>       bool            did_tuple_hint_fpi;
>       int64           fpi_before = pgWalUsage.wal_fpi;
> +     TransactionId conflict_xid = InvalidTransactionId;
>  
>       /* Initialize prstate */
>       prune_freeze_setup(params,
>                                          new_relfrozen_xid, new_relmin_mxid,
>                                          presult, &prstate);
>  
> -     if ((prstate.vmbits & VISIBILITYMAP_VALID_BITS) &&
> +     if ((prstate.old_vmbits & VISIBILITYMAP_VALID_BITS) &&
>               !PageIsAllVisible(prstate.page))
>               heap_fix_vm_corruption(&prstate, InvalidOffsetNumber);

There are so many changes related to s/vmbits/old_vmbits/. How about naming it
old_vmbits from the start? That'll make this commit a lot less noisy.



> @@ -1076,6 +1116,30 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>               prstate.set_all_visible = prstate.set_all_frozen = false;
>  
>       Assert(!prstate.set_all_frozen || prstate.set_all_visible);
> +     Assert(!prstate.set_all_visible || (prstate.lpdead_items == 0));

Why didn't we have this assert earlier?


> +     do_set_vm = heap_page_will_set_vm(&prstate, params->reason);

Most of the other heap_page_prune_and_freeze() helpers are named
heap_prune_xyz(), why not follow that here?

I guess this holds for a few other helpers added in earlier commits
too. E.g. heap_page_bypass_prune_freeze() should probably be
heap_prune_bypass_prune_freeze() or such.


> +     /*
> +      * new_vmbits should be 0 regardless of whether or not the page is
> +      * all-visible if we do not intend to set the VM.
> +      */
> +     Assert(do_set_vm || prstate.new_vmbits == 0);
> +
> +     /*
> +      * The snapshot conflict horizon for the whole record is the most
> +      * conservative (newest) horizon required by any change in the record.
> +      */
> +     if (do_set_vm)
> +             conflict_xid = prstate.newest_live_xid;
> +     if (do_freeze && 
> TransactionIdFollows(prstate.pagefrz.FreezePageConflictXid, conflict_xid))
> +             conflict_xid = prstate.pagefrz.FreezePageConflictXid;
> +     if (do_prune && TransactionIdFollows(prstate.latest_xid_removed, 
> conflict_xid))
> +             conflict_xid = prstate.latest_xid_removed;

I guess I'd personally move the initialization of conflict_xid to
InvalidTransactionId to just before the if, to make it clearer where we start
from if !do_set_vm.


> @@ -1097,14 +1161,17 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>  
>               /*
>                * If that's all we had to do to the page, this is a 
> non-WAL-logged
> -              * hint.  If we are going to freeze or prune the page, we will 
> mark
> -              * the buffer dirty below.
> +              * hint.  If we are going to freeze or prune the page or set
> +              * PD_ALL_VISIBLE, we will mark the buffer dirty below.
> +              *
> +              * Setting PD_ALL_VISIBLE is fully WAL-logged because it is 
> forbidden
> +              * for the VM to be set and PD_ALL_VISIBLE to be clear.
>                */
> -             if (!do_freeze && !do_prune)
> +             if (!do_freeze && !do_prune && !do_set_vm)
>                       MarkBufferDirtyHint(prstate.buffer, true);
>       }

This block is gated by if (do_hint_prune) which is computed as:

        /*
         * Even if we don't prune anything, if we found a new value for the
         * pd_prune_xid field or the page was marked full, we will update the 
hint
         * bit.
         */
        do_hint_prune = PageGetPruneXid(prstate.page) != prstate.new_prune_xid 
||
                PageIsFull(prstate.page);

It's not really related to this change, but I'm just confused a bit by the
"|| PageIsFull(prstate.page)". What is that about? Why do we want to mark the
buffer DirtyHint if the page is full? It very well might already have been
marked as such, no?



> -     if (do_prune || do_freeze)
> +     if (do_prune || do_freeze || do_set_vm)
>       {
>               /* Apply the planned item changes and repair page 
> fragmentation. */
>               if (do_prune)
> @@ -1118,6 +1185,27 @@ heap_page_prune_and_freeze(PruneFreezeParams *params,
>               if (do_freeze)
>                       heap_freeze_prepared_tuples(prstate.buffer, 
> prstate.frozen, prstate.nfrozen);
>  
> +             /* Set the visibility map and page visibility hint */
> +             if (do_set_vm)
> +             {
> +                     /*
> +                      * While it is valid for PD_ALL_VISIBLE to be set when 
> the
> +                      * corresponding VM bit is clear, we strongly prefer to 
> keep them
> +                      * in sync.
> +                      *
> +                      * The heap buffer must be marked dirty before adding 
> it to the
> +                      * WAL chain when setting the VM. We don't worry about
> +                      * unnecessarily dirtying the heap buffer if 
> PD_ALL_VISIBLE is
> +                      * already set, though. It is extremely rare to have a 
> clean heap
> +                      * buffer with PD_ALL_VISIBLE already set and the VM 
> bits clear,
> +                      * so there is no point in optimizing it.
> +                      */
> +                     PageSetAllVisible(prstate.page);
> +                     PageClearPrunable(prstate.page);

Idle thought, not to be acted on now: Eventually it could make sense to not do
PageClearPrunable() if we are not marking the page frozen, but instead replace
the prune xid with something triggering on-access pruning when freezing is
reasonable.


> +     /*
> +      * During its second pass over the heap, VACUUM calls
> +      * heap_page_would_be_all_visible() to determine whether a page is
> +      * all-visible and all-frozen. The logic here is similar. After 
> completing
> +      * pruning and freezing, use an assertion to verify that our results
> +      * remain consistent with heap_page_would_be_all_visible().
> +      */
> +#ifdef USE_ASSERT_CHECKING
> +     if (prstate.set_all_visible)
> +     {
> +             TransactionId debug_cutoff;
> +             bool            debug_all_frozen;
> +
> +             Assert(prstate.lpdead_items == 0);
> +
> +             Assert(heap_page_is_all_visible(prstate.relation, 
> prstate.buffer,
> +                                                                             
> prstate.vistest,
> +                                                                             
> &debug_all_frozen,
> +                                                                             
> &debug_cutoff, off_loc));
> +
> +             /*
> +              * It's possible the page is composed entirely of frozen tuples 
> but is
> +              * not set all-frozen in the VM and did not pass
> +              * HEAP_PAGE_PRUNE_FREEZE. In this case, it's possible
> +              * heap_page_is_all_visible() finds the page completely frozen, 
> even
> +              * though prstate.set_all_frozen is false.
> +              */
> +             Assert(!prstate.set_all_frozen || debug_all_frozen);

Seems like we could verify that debug_cutoff isn't newer than conflict_xid?


> +     }
> +#endif

Hm.  I guess aborting after we did incorrect pruning/freezing/VMing is better
than not, but it'd be even better if we did it before corrupting things. But I
guess it'd be not trivial to add something like the debug_cutoff assertion I
suggest above, when freezing of tuples is only executed after
heap_page_is_all_visible() (for dead tuples heap_page_would_be_all_visible()
already has provisions).

It's probably more a theoretical concern than a real worry.

> +     presult->new_all_visible_pages = 0;
> +     presult->new_all_frozen_pages = 0;
> +     presult->new_all_visible_frozen_pages = 0;

Isn't it odd to talk about pages here? Given that heap_page_prune_and_freeze()
only ever operates on exactly one page.  Is that just so you can do

> +     vacrel->new_all_visible_pages += presult.new_all_visible_pages;

etc?


> +     if (do_set_vm)
> +     {
> +             if ((prstate.old_vmbits & VISIBILITYMAP_ALL_VISIBLE) == 0)
> +             {
> +                     presult->new_all_visible_pages = 1;
> +                     if (prstate.set_all_frozen)
> +                             presult->new_all_visible_frozen_pages = 1;
> +             }
> +             else if ((prstate.old_vmbits & VISIBILITYMAP_ALL_FROZEN) == 0 &&
> +                              prstate.set_all_frozen)
> +                     presult->new_all_frozen_pages = 1;
> +     }
> +
>       if (prstate.attempt_freeze)
>       {
>               if (presult->nfrozen > 0)

Feels like this is kinda redoing what heap_page_will_set_vm already did.


> @@ -472,7 +458,13 @@ extern void log_heap_prune_and_freeze(Relation relation, 
> Buffer buffer,
>  /* in heap/vacuumlazy.c */
>  extern void heap_vacuum_rel(Relation rel,
>                                                       const VacuumParams 
> params, BufferAccessStrategy bstrategy);
> -
> +#ifdef USE_ASSERT_CHECKING
> +extern bool heap_page_is_all_visible(Relation rel, Buffer buf,
> +                                                                      
> GlobalVisState *vistest,
> +                                                                      bool 
> *all_frozen,
> +                                                                      
> TransactionId *visibility_cutoff_xid,
> +                                                                      
> OffsetNumber *logging_offnum);
> +#endif
>  /* in heap/heapam_visibility.c */
>  extern bool HeapTupleSatisfiesVisibility(HeapTuple htup, Snapshot snapshot,
>                                                                               
>  Buffer buffer);

I'd not remove the newline before "/* in heap/heapam_visibility.c */". Other
"sections" do have that newline before the "/* in $filename */" comment too.


> From c47a6270a0a0045347cdb4597b957798d21db4aa Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Sat, 27 Sep 2025 11:55:21 -0400
> Subject: [PATCH v40 06/12] Eliminate XLOG_HEAP2_VISIBLE from empty-page vacuum

Same comment about Eliminate as in the prior commit.

Perhaps worth mentioning more explicitly that this doesn't really have an
advantage other than getting rid of the last user of XLOG_HEAP2_VISIBLE?



> @@ -1923,13 +1926,33 @@ lazy_scan_new_or_empty(LVRelState *vacrel, Buffer 
> buf, BlockNumber blkno,
>                       PageSetAllVisible(page);
>                       PageClearPrunable(page);
> -                     visibilitymap_set(vacrel->rel, blkno, buf,
> -                                                       InvalidXLogRecPtr,
> -                                                       vmbuffer, 
> InvalidTransactionId,
> -                                                       
> VISIBILITYMAP_ALL_VISIBLE |
> -                                                       
> VISIBILITYMAP_ALL_FROZEN);
> +                     visibilitymap_set_vmbits(blkno,
> +                                                                      
> vmbuffer,
> +                                                                      
> VISIBILITYMAP_ALL_VISIBLE |
> +                                                                      
> VISIBILITYMAP_ALL_FROZEN,
> +                                                                      
> vacrel->rel->rd_locator);
> +
> +                     /*
> +                      * Emit WAL for setting PD_ALL_VISIBLE on the heap page 
> and
> +                      * setting the VM.
> +                      */
> +                     if (RelationNeedsWAL(vacrel->rel))
> +                             log_heap_prune_and_freeze(vacrel->rel, buf,
> +                                                                             
>   vmbuffer,
> +                                                                             
>   VISIBILITYMAP_ALL_VISIBLE |
> +                                                                             
>   VISIBILITYMAP_ALL_FROZEN,
> +                                                                             
>   InvalidTransactionId, /* conflict xid */
> +                                                                             
>   false,        /* cleanup lock */
> +                                                                             
>   PRUNE_VACUUM_SCAN,    /* reason */
> +                                                                             
>   NULL, 0,
> +                                                                             
>   NULL, 0,
> +                                                                             
>   NULL, 0,
> +                                                                             
>   NULL, 0);
> +
>                       END_CRIT_SECTION();

It's a tad odd that we do:

                        /*
                         * It's possible that another backend has extended the 
heap,
                         * initialized the page, and then failed to WAL-log the 
page due
                         * to an ERROR.  Since heap extension is not 
WAL-logged, recovery
                         * might try to replay our record setting the page 
all-visible and
                         * find that the page isn't initialized, which will 
cause a PANIC.
                         * To prevent that, check whether the page has been 
previously
                         * WAL-logged, and if not, do that now.
                         */
                        if (RelationNeedsWAL(vacrel->rel) &&
                                !XLogRecPtrIsValid(PageGetLSN(page)))
                                log_newpage_buffer(buf, true);

if we then immediately afterwards emit a WAL record that could just as well
have included in FPI of the heap page.



> From 181c83f0652bfebe0db2f11983ad08b52c8c780b Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Sat, 27 Sep 2025 11:55:36 -0400
> Subject: [PATCH v40 07/12] Remove XLOG_HEAP2_VISIBLE entirely
> 
> There are no remaining users that emit XLOG_HEAP2_VISIBLE records, so it
> can be removed. This includes deleting the xl_heap_visible struct and
> all functions responsible for emitting or replaying XLOG_HEAP2_VISIBLE
> records.

> This changes the visibility map API, so any external users/consumers of
> the VM-only WAL record will need to change.

I hope there aren't any. Not sure I can really see scenarios in which that'd
be a safe thing to do from an external user...




> diff --git a/src/include/access/heapam_xlog.h 
> b/src/include/access/heapam_xlog.h
> index ce3566ba949..5eed567a8e5 100644
> --- a/src/include/access/heapam_xlog.h
> +++ b/src/include/access/heapam_xlog.h
> @@ -60,7 +60,6 @@
>  #define XLOG_HEAP2_PRUNE_ON_ACCESS      0x10
>  #define XLOG_HEAP2_PRUNE_VACUUM_SCAN    0x20
>  #define XLOG_HEAP2_PRUNE_VACUUM_CLEANUP 0x30
> -#define XLOG_HEAP2_VISIBLE      0x40
>  #define XLOG_HEAP2_MULTI_INSERT 0x50
>  #define XLOG_HEAP2_LOCK_UPDATED 0x60
>  #define XLOG_HEAP2_NEW_CID      0x70
> @@ -443,20 +442,6 @@ typedef struct xl_heap_inplace

I think other places with a gap in the "actions" mention that some value is
now unused.


> diff --git a/src/backend/access/common/bufmask.c 
> b/src/backend/access/common/bufmask.c
> index 8a67bfa1aff..d9042e1f91d 100644
> --- a/src/backend/access/common/bufmask.c
> +++ b/src/backend/access/common/bufmask.c
> @@ -56,8 +56,8 @@ mask_page_hint_bits(Page page)
>  
>       /*
>        * During replay, if the page LSN has advanced past our XLOG record's 
> LSN,
> -      * we don't mark the page all-visible. See heap_xlog_visible() for
> -      * details.
> +      * we don't mark the page all-visible. See heap_xlog_prune_freeze() for
> +      * more details.
>        */
>       PageClearAllVisible(page);
>  }

Not introduced by your change, but isn't it rather terrifying that the
wal_consistency_checking infrastructure doesn't verify whether the page is
marked all-visible? Wasn't aware of this. Seems bonkers to me.

I don't even know what specifically in heap_xlog_visible() that comment is
referring to? Just that we only do PageSetAllVisible() if BLK_NEEDS_REDO? But
uh, what does that have to do with anything?


> diff --git a/src/backend/access/heap/visibilitymap.c 
> b/src/backend/access/heap/visibilitymap.c
> index e21b96281a6..f1da52b2069 100644
> --- a/src/backend/access/heap/visibilitymap.c
> +++ b/src/backend/access/heap/visibilitymap.c
> @@ -14,8 +14,7 @@
>   *           visibilitymap_clear  - clear bits for one page in the 
> visibility map
>   *           visibilitymap_pin        - pin a map page for setting a bit
>   *           visibilitymap_pin_ok - check whether correct map page is 
> already pinned
> - *           visibilitymap_set        - set bit(s) in a previously pinned 
> page and log
> - *           visibilitymap_set_vmbits - set bit(s) in a pinned page
> + *           visibilitymap_set        - set bit(s) in a previously pinned 
> page
>   *           visibilitymap_get_status - get status of bits
>   *           visibilitymap_count  - count number of bits set in visibility 
> map
>   *           visibilitymap_prepare_truncate -

There's a comment saying:

 * Clearing visibility map bits is not separately WAL-logged.  The callers
 * must make sure that whenever a bit is cleared, the bit is cleared on WAL
 * replay of the updating operation as well.

Which kinda implies that setting the VM *is* separately WAL logged. But that's
not true anymore.  Maybe rephrase that ever so slightly?


> @@ -477,12 +477,12 @@ ResolveRecoveryConflictWithSnapshot(TransactionId 
> snapshotConflictHorizon,
>        * If we get passed InvalidTransactionId then we do nothing (no 
> conflict).
>        *
>        * This can happen when replaying already-applied WAL records after a
> -      * standby crash or restart, or when replaying an XLOG_HEAP2_VISIBLE
> -      * record that marks as frozen a page which was already all-visible.  
> It's
> -      * also quite common with records generated during index deletion
> -      * (original execution of the deletion can reason that a recovery 
> conflict
> -      * which is sufficient for the deletion operation must take place before
> -      * replay of the deletion record itself).
> +      * standby crash or restart

Again not about your patch: I don't understand how already applied WAL can
lead to InvalidTransactionId being passed here. The record doesn't change just
because we had already applied the WAL?



> From 04b03c1ec3abcee75e464fef994b482df41b35f4 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Wed, 3 Dec 2025 15:07:24 -0500
> Subject: [PATCH v40 08/12] Track which relations are modified by a query
> 
> Save the relids of modified relations in a bitmap in the executor state.
> A later commit will pass this information down to scan nodes to control
> whether or not on-access pruning is allowed to set the visibility map.
> Setting the visibility map during a scan is counterproductive if the
> query is going to modify the page immediately after.
> 
> Relations are considered modified if they are the target of INSERT,
> UPDATE, DELETE, or MERGE, or if they have any row mark (including SELECT
> FOR UPDATE/SHARE). All row mark types are included, even those which
> don't actually modify tuples, because this bitmap is only used as a hint
> to avoid unnecessary work.

You're probably going to hate me for the question, but is there a reason to
not compute es_modified_relids at plan time?


> @@ -992,6 +996,10 @@ InitPlan(QueryDesc *queryDesc, int eflags)
>        */
>       planstate = ExecInitNode(plan, estate, eflags);
>  
> +#ifdef USE_ASSERT_CHECKING
> +     CrossCheckModifiedRelids(estate);
> +#endif

Not sure that buys you much, given it pretty much is just a restatement of the
code building estate->es_modified_relids.

What about checking against PlannedStmt->{resultRelations, permInfos} or
asserting membership at the places that actually lock/modify?


> @@ -3048,6 +3056,12 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
>       rcestate->es_output_cid = parentestate->es_output_cid;
>       rcestate->es_queryEnv = parentestate->es_queryEnv;
>  
> +     /*
> +      * Use a deep copy to avoid stale pointers since bms_add_member() may
> +      * reallocate the bitmap.
> +      */
> +     rcestate->es_modified_relids = 
> bms_copy(parentestate->es_modified_relids);
> +
>       /*
>        * ResultRelInfos needed by subplans are initialized from scratch when 
> the
>        * subplans themselves are initialized.

Hm. Why copy at all from the parent? Afaict we'll just redo the computation of
es_modified_relids from scratch anyway?  Not sure about it though.



> From 05d736fb5b0600effede5e030d5b929274dabe2c Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Mon, 2 Mar 2026 16:31:17 -0500
> Subject: [PATCH v40 09/12] Thread flags through begin-scan APIs
> 
> Add a flags parameter to the index_fetch_begin() table AM callback and
> the begin-scan helpers so the executor can pass context for building
> scan descriptors. This introduces an extension point for follow-up work
> to mark relations as read-only for the current query, without changing
> behavior in this patch.



> diff --git a/src/include/access/genam.h b/src/include/access/genam.h
> index 1a27bf060b3..db102803eb5 100644
> --- a/src/include/access/genam.h
> +++ b/src/include/access/genam.h
> @@ -158,7 +158,7 @@ extern IndexScanDesc index_beginscan(Relation 
> heapRelation,
>                                                                        
> Relation indexRelation,
>                                                                        
> Snapshot snapshot,
>                                                                        
> IndexScanInstrumentation *instrument,
> -                                                                      int 
> nkeys, int norderbys);
> +                                                                      int 
> nkeys, int norderbys, uint32 flags);

I'd probably put flags in a position where it's not as easily confused with
nkeys or norderbys.


> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 4ce63990326..3820bbd7f9f 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -96,8 +96,9 @@ typedef struct HeapScanDescData
>       ParallelBlockTableScanWorkerData *rs_parallelworkerdata;
>  
>       /*
> -      * For sequential scans and bitmap heap scans. The current heap block's
> -      * corresponding page in the visibility map.
> +      * For sequential scans, bitmap heap scans, TID range scans, and sample
> +      * scans. The current heap block's corresponding page in the visibility
> +      * map.
>        */
>       Buffer          rs_vmbuffer;

As you already can see here, exhaustively listing scan types is unlikely to be
maintained over time...


> @@ -1059,10 +1062,11 @@ table_scan_getnextslot(TableScanDesc sscan, 
> ScanDirection direction, TupleTableS
>  static inline TableScanDesc
>  table_beginscan_tidrange(Relation rel, Snapshot snapshot,
>                                                ItemPointer mintid,
> -                                              ItemPointer maxtid)
> +                                              ItemPointer maxtid, uint32 
> flags)
>  {
>       TableScanDesc sscan;
> -     uint32          flags = SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
> +
> +     flags |= SO_TYPE_TIDRANGESCAN | SO_ALLOW_PAGEMODE;
>  
>       sscan = table_beginscan_common(rel, snapshot, 0, NULL, NULL, flags);

Hm. Would it perhaps be a good idea to have an assert as to which flags are
specified by the "user"? If e.g. another SO_TYPE_* were specified it might
result in some odd behaviour.

Perhaps this would be best done by adding an argument to
table_beginscan_common() specifying the "internal" flags (i.e. the ones that
specified inside table_beginscan_*) and user specified flags?  Then
table_beginscan_common could check the set of user specified flags being sane.



> From 7790c8177ba3aa8a8bd1a216ea77fdfd42efc1bf Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Mon, 2 Mar 2026 16:31:33 -0500
> Subject: [PATCH v40 10/12] Pass down information on table modification to scan
>  node

> diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
> index 3820bbd7f9f..1a7306e2935 100644
> --- a/src/include/access/heapam.h
> +++ b/src/include/access/heapam.h
> @@ -132,6 +132,12 @@ typedef struct IndexFetchHeapData
>  
>       /* Current heap block's corresponding page in the visibility map */
>       Buffer          xs_vmbuffer;
> +
> +     /*
> +      * Some optimizations can only be performed if the query does not modify
> +      * the underlying relation. Track that here.
> +      */
> +     bool            modifies_base_rel;
>  } IndexFetchHeapData;
>  

The other members are prefixed with xs_, I don't see a reason to diverge for
this one.

Wonder if this should be in the generic IndexFetchTableData?


> From 0a16dad7a4ebe224f35629a39619d0feb03f03a3 Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Fri, 27 Feb 2026 16:33:40 -0500
> Subject: [PATCH v40 11/12] Allow on-access pruning to set pages all-visible
> 
> Many queries do not modify the underlying relation. For such queries, if
> on-access pruning occurs during the scan, we can check whether the page
> has become all-visible and update the visibility map accordingly.
> Previously, only vacuum and COPY FREEZE marked pages as all-visible or
> all-frozen.

> This commit implements on-access VM setting for sequential scans as well
> as for the underlying heap relation in index scans and bitmap heap
> scans.

I'd mention that this often can:
- avoid write amplification, due to vacuum later having to PageSetAllVisible()
  (often triggering another data write and another FPI)
- allow index only scans much earlier than before

I think those are pretty huge benefits, so they should be mentioned.


> diff --git a/src/test/recovery/t/035_standby_logical_decoding.pl 
> b/src/test/recovery/t/035_standby_logical_decoding.pl
> index d264a698ff6..a5536ba4ff6 100644
> --- a/src/test/recovery/t/035_standby_logical_decoding.pl
> +++ b/src/test/recovery/t/035_standby_logical_decoding.pl
> @@ -296,6 +296,7 @@ wal_level = 'logical'
>  max_replication_slots = 4
>  max_wal_senders = 4
>  autovacuum = off
> +hot_standby_feedback = on
>  });
>  $node_primary->dump_info;
>  $node_primary->start;
> @@ -748,7 +749,7 @@ check_pg_recvlogical_stderr($handle,
>  $logstart = -s $node_standby->logfile;
>  
>  reactive_slots_change_hfs_and_wait_for_xmins('shared_row_removal_',
> -     'no_conflict_', 0, 1);
> +     'no_conflict_', 1, 0);
>  
>  # This should not trigger a conflict
>  wait_until_vacuum_can_remove(
> -- 
> 2.43.0

Why does this patch need to change anything here? Is the test buggy
independently?



> From e4c7112d49e650f59dab834d3db6007c69f34f1a Mon Sep 17 00:00:00 2001
> From: Melanie Plageman <[email protected]>
> Date: Tue, 29 Jul 2025 16:12:56 -0400
> Subject: [PATCH v40 12/12] Set pd_prune_xid on insert
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Now that visibility map (VM) updates can occur during read-only queries,
> it makes sense to also set the page’s pd_prune_xid hint during inserts
> and on the new page during updates.
> 
> This enables heap_page_prune_and_freeze() to run and set the VM
> all-visible after a page is filled with newly inserted tuples the first
> time it is read.
> 
> This change also addresses a long-standing note in heap_insert() and
> heap_multi_insert(), which observed that setting pd_prune_xid would
> help clean up aborted insertions sooner. Without it, such tuples might
> linger until VACUUM, whereas now they can be pruned earlier.

I think this commit message should also mention more what the benefits of
doing this are (i.e. a good potential for reduced write amplicifation and
increased IOS potential).


> The index killtuples test had to be updated to reflect a larger number
> of hits by some accesses. Since the prune_xid is set by the fill/insert
> step, on-access pruning can happen during the first access step (before
> the DELETE). This is when the VM is extended. After the DELETE, the next
> access hits the VM block instead of extending it. Thus, an additional
> buffer hit is counted for the table.

I think that may since already have been solved by f5eb854ab6d.

> --- a/src/backend/access/heap/heapam.c
> +++ b/src/backend/access/heap/heapam.c
> @@ -2156,6 +2156,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId 
> cid,
>       TransactionId xid = GetCurrentTransactionId();
>       HeapTuple       heaptup;
>       Buffer          buffer;
> +     Page            page;
>       Buffer          vmbuffer = InvalidBuffer;
>       bool            all_visible_cleared = false;
>  
> @@ -2182,6 +2183,8 @@ heap_insert(Relation relation, HeapTuple tup, CommandId 
> cid,
>                                                                          
> &vmbuffer, NULL,
>                                                                          0);
>  
> +     page = BufferGetPage(buffer);
> +
>       /*
>        * We're about to do the actual insert -- but check for conflict first, 
> to
>        * avoid possibly having to roll back work we've just done.
> @@ -2205,25 +2208,30 @@ heap_insert(Relation relation, HeapTuple tup, 
> CommandId cid,
>       RelationPutHeapTuple(relation, buffer, heaptup,
>                                                (options & 
> HEAP_INSERT_SPECULATIVE) != 0);
>  
> -     if (PageIsAllVisible(BufferGetPage(buffer)))
> +     if (PageIsAllVisible(page))
>       {
>               all_visible_cleared = true;
> -             PageClearAllVisible(BufferGetPage(buffer));
> +             PageClearAllVisible(page);
>               visibilitymap_clear(relation,
>                                                       
> ItemPointerGetBlockNumber(&(heaptup->t_self)),
>                                                       vmbuffer, 
> VISIBILITYMAP_VALID_BITS);
>       }

The repeated BufferGetPage()s have been bothering me, good :)


>       /*
> -      * XXX Should we set PageSetPrunable on this page ?
> +      * Set pd_prune_xid to trigger heap_page_prune_and_freeze() once the 
> page
> +      * is full so that we can set the page all-visible in the VM on the next
> +      * page access.
>        *
> -      * The inserting transaction may eventually abort thus making this tuple
> -      * DEAD and hence available for pruning. Though we don't want to 
> optimize
> -      * for aborts, if no other tuple in this page is UPDATEd/DELETEd, the
> -      * aborted tuple will never be pruned until next vacuum is triggered.
> +      * Setting pd_prune_xid is also handy if the inserting transaction
> +      * eventually aborts making this tuple DEAD and hence available for
> +      * pruning. If no other tuple in this page is UPDATEd/DELETEd, the 
> aborted
> +      * tuple would never otherwise be pruned until next vacuum is triggered.
>        *
> -      * If you do add PageSetPrunable here, add it in heap_xlog_insert too.
> +      * Don't set it if we are in bootstrap mode or we are inserting a frozen
> +      * tuple.
>        */
> +     if (TransactionIdIsNormal(xid) && !(options & HEAP_INSERT_FROZEN))
> +             PageSetPrunable(page, xid);
>  
>       MarkBufferDirty(buffer);
>  

Perhaps add "as neither of those can be pruned anyway." or such to the last
sentence?



> @@ -1863,16 +1864,14 @@ heap_prune_record_unchanged_lp_normal(PruneState 
> *prstate, OffsetNumber offnum)
>                       prstate->set_all_visible = false;
>                       prstate->set_all_frozen = false;
>  
> -                     /* The page should not be marked all-visible */
> -                     if (PageIsAllVisible(page))
> -                             heap_fix_vm_corruption(prstate, offnum);
> -

Huh?


Getting close, I think.


Greetings,

Andres Freund

Re: eliminate xl_heap_visible to reduce WAL (and eventually set VM on-access)

Reply via email to