Re: 64-bit XIDs in deleted nbtree pages

Peter Geoghegan Wed, 24 Feb 2021 20:42:56 -0800

On Mon, Feb 22, 2021 at 2:54 PM Peter Geoghegan <p...@bowt.ie> wrote:
> Good point. I think that the structure should make the page deletion
> triggering condition have only secondary importance -- it is only
> described at all to be complete and exhaustive. The
> vacuum_cleanup_index_scale_factor-related threshold is all that users
> will really care about in this area.


I pushed the main 64-bit XID commit just now. Thanks!

Attached is v6, with the two remaining patches. No real changes. Just
want to keep CFBot happy.

I would like to talk about vacuum_cleanup_index_scale_factor some
more. I didn't get very far with the vacuum_cleanup_index_scale_factor
documentation (I just removed the existing references to page
deletion). When I was working on the docs I suddenly wondered: is
vacuum_cleanup_index_scale_factor actually necessary? Can we not get
rid of it completely?

The amvacuumcleanup docs seems to suggest that that would be okay:

"It is OK to return NULL if the index was not changed at all during
the VACUUM operation, but otherwise correct stats should be returned."

Currently, _bt_vacuum_needs_cleanup() gets to decide whether or not
the index will change during VACUUM (assuming no deleted pages in the
case of Postgres 11 - 13, or assuming less than ~5% on Postgres 14).
So why even bother with the heap tuple stuff at all? Why not simply
remove the triggering logic that uses btm_last_cleanup_num_heap_tuples
+ vacuum_cleanup_index_scale_factor completely? We can rely on ANALYZE
to set pg_class.reltuples/pg_class.relpages instead. IIUC this is 100%
allowed by the amvacuumcleanup contract.

I think that the original design that made VACUUM set
pg_class.reltuples/pg_class.relpages in indexes (from 15+ years ago)
assumed that it was cheap to handle statistics in passing -- the
marginal cost was approximately zero, so why not just do it? It was
not because VACUUM thinks it is valuable or urgent, and yet
vacuum_cleanup_index_scale_factor seems to assume that it must.

Of course, it may actually be hard/expensive to update the statistics
due to the vacuum_cleanup_index_scale_factor stuff that was added to
Postgres 11. The autovacuum_vacuum_insert_threshold stuff that was
added to Postgres 13 also seems quite relevant. So I think that there
is an inconsistency here.

I can see one small problem with my plan of relying on ANALYZE to do
this: VACUUM ANALYZE trusts amvacuumcleanup/btvacuumcleanup (when
called by lazyvacuum.c) to set pg_class.reltuples/pg_class.relpages
within do_analyze_rel() -- even when amvacuumcleanup/btvacuumcleanup
returns NULL:

    /*
     * Same for indexes. Vacuum always scans all indexes, so if we're part of
     * VACUUM ANALYZE, don't overwrite the accurate count already inserted by
     * VACUUM.
     */
    if (!inh && !(params->options & VACOPT_VACUUM))
    {
        for (ind = 0; ind < nindexes; ind++)
        {
            AnlIndexData *thisdata = &indexdata[ind];
            double      totalindexrows;

            totalindexrows = ceil(thisdata->tupleFract * totalrows);
            vac_update_relstats(Irel[ind],
                                RelationGetNumberOfBlocks(Irel[ind]),
                                totalindexrows,
                                0,
                                false,
                                InvalidTransactionId,
                                InvalidMultiXactId,
                                in_outer_xact);
        }
    }

But this just seems like a very old bug to me. This bug can be fixed
separately by teaching VACUUM ANALYZE to recognize cases where indexes
did not have their stats updated in the way it expects.

BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
cases following the deduplication commit until my bug fix commit
48e12913 (which was kind of a hack itself). This meant that the
statistics set by btvacuumcleanup (in the case where btbulkdelete
doesn't get called, the relevant case for
vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
before anybody noticed (or at least until anybody complained).

Am I missing something here?

-- 
Peter Geoghegan

From 115439b7d68a273ea31231b2a68ed23221f05f36 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Mon, 22 Feb 2021 21:40:50 -0800
Subject: [PATCH v6 2/2] Show "pages newly deleted" in VACUUM VERBOSE output.

Teach VACUUM VERBOSE to distinguish between pages that were deleted by
the current VACUUM operation and all deleted pages in the index (without
regard to when or how they became deleted).  The latter metric has been
output by VACUUM verbose for many years.  Showing both together seems
far more informative.

The new VACUUM VERBOSE field will be helpful to both PostgreSQL users
and PostgreSQL developers that want to understand when and how page
deletions are executed, and when and how free pages can actually be
recycled.
---
 src/include/access/genam.h            | 11 ++++++++---
 src/backend/access/gin/ginvacuum.c    |  1 +
 src/backend/access/gist/gistvacuum.c  | 19 ++++++++++++++++---
 src/backend/access/heap/vacuumlazy.c  |  4 +++-
 src/backend/access/nbtree/nbtpage.c   |  4 ++++
 src/backend/access/nbtree/nbtree.c    | 17 +++++++++++------
 src/backend/access/spgist/spgvacuum.c |  1 +
 7 files changed, 44 insertions(+), 13 deletions(-)

diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index ffa1a4c80d..13971c8b2a 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -63,8 +63,12 @@ typedef struct IndexVacuumInfo
  * of which this is just the first field; this provides a way for ambulkdelete
  * to communicate additional private data to amvacuumcleanup.
  *
- * Note: pages_deleted and pages_free refer to free space within the index
- * file.  Some index AMs may compute num_index_tuples by reference to
+ * Note: pages_newly_deleted is the number of pages in the index that were
+ * deleted by the current vacuum operation.  pages_deleted and pages_free
+ * refer to free space within the index file (and so pages_deleted must be >=
+ * pages_newly_deleted).
+ *
+ * Note: Some index AMs may compute num_index_tuples by reference to
  * num_heap_tuples, in which case they should copy the estimated_count field
  * from IndexVacuumInfo.
  */
@@ -74,7 +78,8 @@ typedef struct IndexBulkDeleteResult
 	bool		estimated_count;	/* num_index_tuples is an estimate */
 	double		num_index_tuples;	/* tuples remaining */
 	double		tuples_removed; /* # removed during vacuum operation */
-	BlockNumber pages_deleted;	/* # unused pages in index */
+	BlockNumber pages_newly_deleted;	/* # pages marked deleted by us  */
+	BlockNumber pages_deleted;	/* # pages marked deleted (could be by us) */
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
diff --git a/src/backend/access/gin/ginvacuum.c b/src/backend/access/gin/ginvacuum.c
index a0453b36cd..a276eb020b 100644
--- a/src/backend/access/gin/ginvacuum.c
+++ b/src/backend/access/gin/ginvacuum.c
@@ -231,6 +231,7 @@ ginDeletePage(GinVacuumState *gvs, BlockNumber deleteBlkno, BlockNumber leftBlkn
 
 	END_CRIT_SECTION();
 
+	gvs->result->pages_newly_deleted++;
 	gvs->result->pages_deleted++;
 }
 
diff --git a/src/backend/access/gist/gistvacuum.c b/src/backend/access/gist/gistvacuum.c
index ddecb8ab18..0663193531 100644
--- a/src/backend/access/gist/gistvacuum.c
+++ b/src/backend/access/gist/gistvacuum.c
@@ -133,9 +133,21 @@ gistvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	MemoryContext oldctx;
 
 	/*
-	 * Reset counts that will be incremented during the scan; needed in case
-	 * of multiple scans during a single VACUUM command.
+	 * Reset fields that track information about the entire index now.  This
+	 * avoids double-counting in the case where a single VACUUM command
+	 * requires multiple scans of the index.
+	 *
+	 * Avoid resetting the tuples_removed and pages_newly_deleted fields here,
+	 * since they track information about the VACUUM command, and so must last
+	 * across each call to gistvacuumscan().
+	 *
+	 * (Note that pages_free is treated as state about the whole index, not
+	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
+	 * calls are idempotent, and get repeated for the same deleted pages in
+	 * some scenarios.  The point for us is to track the number of recyclable
+	 * pages in the index at the end of the VACUUM command.)
 	 */
+	stats->num_pages = 0;
 	stats->estimated_count = false;
 	stats->num_index_tuples = 0;
 	stats->pages_deleted = 0;
@@ -281,8 +293,8 @@ restart:
 	{
 		/* Okay to recycle this page */
 		RecordFreeIndexPage(rel, blkno);
-		vstate->stats->pages_free++;
 		vstate->stats->pages_deleted++;
+		vstate->stats->pages_free++;
 	}
 	else if (GistPageIsDeleted(page))
 	{
@@ -636,6 +648,7 @@ gistdeletepage(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	/* mark the page as deleted */
 	MarkBufferDirty(leafBuffer);
 	GistPageSetDeleted(leafPage, txid);
+	stats->pages_newly_deleted++;
 	stats->pages_deleted++;
 
 	/* remove the downlink from the parent */
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 0bb78162f5..d8f847b0e6 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2521,9 +2521,11 @@ lazy_cleanup_index(Relation indrel,
 						(*stats)->num_index_tuples,
 						(*stats)->num_pages),
 				 errdetail("%.0f index row versions were removed.\n"
-						   "%u index pages have been deleted, %u are currently reusable.\n"
+						   "%u index pages were newly deleted.\n"
+						   "%u index pages are currently deleted, of which %u are currently reusable.\n"
 						   "%s.",
 						   (*stats)->tuples_removed,
+						   (*stats)->pages_newly_deleted,
 						   (*stats)->pages_deleted, (*stats)->pages_free,
 						   pg_rusage_show(&ru0))));
 	}
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 7cf9332be2..9d7d0186d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2675,11 +2675,15 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
+	 * Maintain pages_newly_deleted, which is simply the number of pages
+	 * deleted by the ongoing VACUUM operation.
+	 *
 	 * Maintain pages_deleted in a way that takes into account how
 	 * btvacuumpage() will count deleted pages that have yet to become
 	 * scanblkno -- only count page when it's not going to get that treatment
 	 * later on.
 	 */
+	stats->pages_newly_deleted++;
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index b25c8c5d5b..c7b05df2df 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -876,6 +876,9 @@ _bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
 	IndexBulkDeleteResult *stats = vstate->stats;
 	Relation	heapRel;
 
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
 	/*
 	 * Recompute VACUUM XID boundaries.
 	 *
@@ -1078,9 +1081,9 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 * avoids double-counting in the case where a single VACUUM command
 	 * requires multiple scans of the index.
 	 *
-	 * Avoid resetting the tuples_removed field here, since it tracks
-	 * information about the VACUUM command, and so must last across each call
-	 * to btvacuumscan().
+	 * Avoid resetting the tuples_removed and pages_newly_deleted fields here,
+	 * since they track information about the VACUUM command, and so must last
+	 * across each call to btvacuumscan().
 	 *
 	 * (Note that pages_free is treated as state about the whole index, not
 	 * the current VACUUM.  This is appropriate because RecordFreeIndexPage()
@@ -1321,8 +1324,8 @@ backtrack:
 	else if (P_ISHALFDEAD(opaque))
 	{
 		/*
-		 * Half-dead leaf page.  Try to delete now.  Might update
-		 * pages_deleted below.
+		 * Half-dead leaf page.  Try to delete now.  Might end up incrementing
+		 * pages_newly_deleted/pages_deleted inside _bt_pagedel.
 		 */
 		attempt_pagedel = true;
 	}
@@ -1534,7 +1537,9 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * _bt_pagedel maintains the bulk delete stats on our behalf
+		 * _bt_pagedel maintains the bulk delete stats on our behalf;
+		 * pages_newly_deleted and pages_deleted are likely to be incremented
+		 * during call
 		 */
 		Assert(blkno == scanblkno);
 		_bt_pagedel(rel, buf, vstate);
diff --git a/src/backend/access/spgist/spgvacuum.c b/src/backend/access/spgist/spgvacuum.c
index 0d02a02222..a9ffca5183 100644
--- a/src/backend/access/spgist/spgvacuum.c
+++ b/src/backend/access/spgist/spgvacuum.c
@@ -891,6 +891,7 @@ spgvacuumscan(spgBulkDeleteState *bds)
 
 	/* Report final stats */
 	bds->stats->num_pages = num_pages;
+	bds->stats->pages_newly_deleted = bds->stats->pages_deleted;
 	bds->stats->pages_free = bds->stats->pages_deleted;
 }
 
-- 
2.27.0

From 3f3a1b3d856c28d37d3d61ed7460943de03f8199 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Mon, 22 Feb 2021 21:40:50 -0800
Subject: [PATCH v6 1/2] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <p...@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iuscb1un44-gyzl-kgpsxbsxq_bdcma7q...@mail.gmail.com
---
 src/include/access/nbtree.h         |  38 +++++++++-
 src/backend/access/nbtree/README    |  31 ++++++++
 src/backend/access/nbtree/nbtpage.c |  90 ++++++++++++++++------
 src/backend/access/nbtree/nbtree.c  | 111 ++++++++++++++++++++++++----
 src/backend/access/nbtree/nbtxlog.c |  22 ++++++
 5 files changed, 251 insertions(+), 41 deletions(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ac90d7439..736d69b304 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -312,6 +313,39 @@ BTPageIsRecyclable(Page page)
 	return false;
 }
 
+/*
+ * BTVacState is nbtree.c state used during VACUUM.  It is exported for use by
+ * page deletion related code in nbtpage.c.
+ */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
+typedef struct BTVacState
+{
+	/*
+	 * VACUUM operation state
+	 */
+	IndexVacuumInfo *info;
+	IndexBulkDeleteResult *stats;
+	IndexBulkDeleteCallback callback;
+	void	   *callback_state;
+	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
+	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
+} BTVacState;
+
 /*
  *	Lehman and Yao's algorithm requires a ``high key'' on every non-rightmost
  *	page.  The high key is not a tuple that is used to visit the heap.  It is
@@ -1181,7 +1215,7 @@ extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 extern void _bt_delitems_delete_check(Relation rel, Buffer buf,
 									  Relation heapRel,
 									  TM_IndexDeleteOp *delstate);
-extern uint32 _bt_pagedel(Relation rel, Buffer leafbuf);
+extern void _bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate);
 
 /*
  * prototypes for functions in nbtsearch.c
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a43805a7b0..7cf9332be2 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -50,7 +50,7 @@ static bool _bt_mark_page_halfdead(Relation rel, Buffer leafbuf,
 static bool _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf,
 									 BlockNumber scanblkno,
 									 bool *rightsib_empty,
-									 uint32 *ndeleted);
+									 BTVacState *vstate);
 static bool _bt_lock_subtree_parent(Relation rel, BlockNumber child,
 									BTStack stack,
 									Buffer *subtreeparent,
@@ -1760,20 +1760,22 @@ _bt_rightsib_halfdeadflag(Relation rel, BlockNumber leafrightsib)
  * should never pass a buffer containing an existing deleted page here.  The
  * lock and pin on caller's buffer will be dropped before we return.
  *
- * Returns the number of pages successfully deleted (zero if page cannot
- * be deleted now; could be more than one if parent or right sibling pages
- * were deleted too).  Note that this does not include pages that we delete
- * that the btvacuumscan scan has yet to reach; they'll get counted later
- * instead.
+ * Maintains bulk delete stats for caller, which are taken from vstate.  We
+ * need to cooperate closely with caller here so that whole VACUUM operation
+ * reliably avoids any double counting of subsidiary-to-leafbuf pages that we
+ * delete in passing.  If such pages happen to be from a block number that is
+ * ahead of the current scanblkno position, then caller is expected to count
+ * them directly later on.  It's simpler for us to understand caller's
+ * requirements than it would be for caller to understand when or how a
+ * deleted page became deleted after the fact.
  *
  * NOTE: this leaks memory.  Rather than trying to clean up everything
  * carefully, it's better to run it in a temp context that can be reset
  * frequently.
  */
-uint32
-_bt_pagedel(Relation rel, Buffer leafbuf)
+void
+_bt_pagedel(Relation rel, Buffer leafbuf, BTVacState *vstate)
 {
-	uint32		ndeleted = 0;
 	BlockNumber rightsib;
 	bool		rightsib_empty;
 	Page		page;
@@ -1781,7 +1783,8 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 	/*
 	 * Save original leafbuf block number from caller.  Only deleted blocks
-	 * that are <= scanblkno get counted in ndeleted return value.
+	 * that are <= scanblkno are added to bulk delete stat's pages_deleted
+	 * count.
 	 */
 	BlockNumber scanblkno = BufferGetBlockNumber(leafbuf);
 
@@ -1843,7 +1846,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 										 RelationGetRelationName(rel))));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1873,7 +1876,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			Assert(!P_ISHALFDEAD(opaque));
 
 			_bt_relbuf(rel, leafbuf);
-			return ndeleted;
+			return;
 		}
 
 		/*
@@ -1922,8 +1925,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				if (_bt_leftsib_splitflag(rel, leftsib, leafblkno))
 				{
 					ReleaseBuffer(leafbuf);
-					Assert(ndeleted == 0);
-					return ndeleted;
+					return;
 				}
 
 				/* we need an insertion scan key for the search, so build one */
@@ -1964,7 +1966,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 			if (!_bt_mark_page_halfdead(rel, leafbuf, stack))
 			{
 				_bt_relbuf(rel, leafbuf);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -1979,7 +1981,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 		{
 			/* Check for interrupts in _bt_unlink_halfdead_page */
 			if (!_bt_unlink_halfdead_page(rel, leafbuf, scanblkno,
-										  &rightsib_empty, &ndeleted))
+										  &rightsib_empty, vstate))
 			{
 				/*
 				 * _bt_unlink_halfdead_page should never fail, since we
@@ -1990,7 +1992,7 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 				 * lock and pin on leafbuf for us.
 				 */
 				Assert(false);
-				return ndeleted;
+				return;
 			}
 		}
 
@@ -2026,8 +2028,6 @@ _bt_pagedel(Relation rel, Buffer leafbuf)
 
 		leafbuf = _bt_getbuf(rel, rightsib, BT_WRITE);
 	}
-
-	return ndeleted;
 }
 
 /*
@@ -2262,9 +2262,10 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
  */
 static bool
 _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
-						 bool *rightsib_empty, uint32 *ndeleted)
+						 bool *rightsib_empty, BTVacState *vstate)
 {
 	BlockNumber leafblkno = BufferGetBlockNumber(leafbuf);
+	IndexBulkDeleteResult *stats = vstate->stats;
 	BlockNumber leafleftsib;
 	BlockNumber leafrightsib;
 	BlockNumber target;
@@ -2674,12 +2675,53 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 		_bt_relbuf(rel, buf);
 
 	/*
-	 * If btvacuumscan won't revisit this page in a future btvacuumpage call
-	 * and count it as deleted then, we count it as deleted by current
-	 * btvacuumpage call
+	 * Maintain pages_deleted in a way that takes into account how
+	 * btvacuumpage() will count deleted pages that have yet to become
+	 * scanblkno -- only count page when it's not going to get that treatment
+	 * later on.
 	 */
 	if (target <= scanblkno)
-		(*ndeleted)++;
+		stats->pages_deleted++;
+
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
 
 	return true;
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 3b2e0aa5cb..b25c8c5d5b 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,23 +34,13 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
 #include "utils/memutils.h"
 
 
-/* Working state needed by btvacuumpage */
-typedef struct
-{
-	IndexVacuumInfo *info;
-	IndexBulkDeleteResult *stats;
-	IndexBulkDeleteCallback callback;
-	void	   *callback_state;
-	BTCycleId	cycleid;
-	MemoryContext pagedelcontext;
-} BTVacState;
-
 /*
  * BTPARALLEL_NOT_INITIALIZED indicates that the scan has not started.
  *
@@ -871,6 +863,68 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -956,6 +1010,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
 	 * btvacuumscan() call.
 	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
+	 *
 	 * Note: We must delay the _bt_set_cleanup_info() call until this late
 	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
 	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
@@ -1044,6 +1106,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1112,7 +1184,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
@@ -1451,12 +1534,10 @@ backtrack:
 		oldcontext = MemoryContextSwitchTo(vstate->pagedelcontext);
 
 		/*
-		 * We trust the _bt_pagedel return value because it does not include
-		 * any page that a future call here from btvacuumscan is expected to
-		 * count.  There will be no double-counting.
+		 * _bt_pagedel maintains the bulk delete stats on our behalf
 		 */
 		Assert(blkno == scanblkno);
-		stats->pages_deleted += _bt_pagedel(rel, buf);
+		_bt_pagedel(rel, buf, vstate);
 
 		MemoryContextSwitchTo(oldcontext);
 		/* pagedel released buffer, so we shouldn't */
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 8b7c143db4..6ab9af4a43 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -999,6 +999,28 @@ btree_xlog_newroot(XLogReaderState *record)
  * the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
  * Consequently, one XID value achieves the same exclusion effect on primary
  * and standby.
+ *
+ * XXX It would make a great deal more sense if each nbtree index's FSM (or
+ * some equivalent structure) was completely crash-safe.  Importantly, this
+ * would enable page recycling's REDO side to work in a way that naturally
+ * matches original execution.
+ *
+ * Page deletion has to be crash safe already, plus xl_btree_reuse_page
+ * records are logged any time a backend has to recycle -- full crash safety
+ * is unlikely to add much overhead, and has clear efficiency benefits.  It
+ * would also simplify things by more explicitly decoupling page deletion and
+ * page recycling.  The benefits for REDO all follow from that.
+ *
+ * Under this scheme, the whole question of recycle safety could be moved from
+ * VACUUM to the consumer side.  That is, VACUUM would no longer have to defer
+ * placing a page that it deletes in the FSM until BTPageIsRecyclable() starts
+ * to return true -- _bt_getbut() would handle all details of safely deferring
+ * recycling instead.  _bt_getbut() would use the improved/crash-safe FSM to
+ * explicitly find a free page whose safexid is sufficiently old for recycling
+ * to be safe from the point of view of backends that run during original
+ * execution.  That just leaves the REDO side.  Instead of xl_btree_reuse_page
+ * records, we'd have FSM "consume/recycle page from the FSM" records that are
+ * associated with FSM page buffers/blocks.
  */
 static void
 btree_xlog_reuse_page(XLogReaderState *record)
-- 
2.27.0

Re: 64-bit XIDs in deleted nbtree pages

Reply via email to