Re: 64-bit XIDs in deleted nbtree pages

Peter Geoghegan Thu, 25 Feb 2021 16:59:04 -0800

On Thu, Feb 25, 2021 at 5:42 AM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> btvacuumcleanup()  has been playing two roles: recycling deleted pages
> and collecting index statistics.


Right.

I pushed the VACUUM VERBOSE "index pages newly deleted"
instrumentation patch earlier - it really isn't complicated or
controversial, so I saw no reason to delay with that.

Attached is v7, which now only has the final patch -- the optimization
that makes it possible for VACUUM to recycle pages that were newly
deleted during the same VACUUM operation.  Still no real changes.
Again, I just wanted to keep CFBot happy. I haven't thought about or
improved this final patch recently, and it clearly needs more work to
be ready to commit.

> If we don't want btvacuumcleanup() to collect index statistics, we can
> remove vacuum_cleanup_index_scale_factor (at least from btree
> perspectives), as you mentioned. One thing that may be worth
> mentioning is that the difference between the index statistics taken
> by ANALYZE and btvacuumcleanup() is that the former statistics is
> always an estimation. That’s calculated by compute_index_stats()
> whereas the latter uses the result of an index scan. If
> btvacuumcleanup() doesn’t scan the index and always returns NULL, it
> would become hard to get accurate index statistics, for example in a
> static table case. I've not checked which cases index statistics
> calculated by compute_index_stats() are inaccurate, though.

The historic context makes it easier to understand what to do here --
it makes it clear that amvacuumcleanup() routine does not (or should
not) do any index scan when the index hasn't (and won't) be modified
by the current VACUUM operation. The relevant sgml doc sentence I
quoted to you recently ("It is OK to return NULL if the index was not
changed at all during the VACUUM operation...") was added by commit
e57345975cf in 2006. Much of the relevant 2006 discussion is here,
FWIW:

https://www.postgresql.org/message-id/flat/26433.1146598265%40sss.pgh.pa.us#862ee11c24da63d0282e0025abbad19c

So now we have the formal rules for index AMs, as well as background
information about what various hackers (mostly Tom) were considering
when the rules were written.

> According to the doc, if amvacuumcleanup/btvacuumcleanup returns NULL,
> it means the index is not changed at all. So do_analyze_rel() executed
> by VACUUM ANALYZE also doesn't need to update the index statistics
> even when amvacuumcleanup/btvacuumcleanup returns NULL. No?

Consider hashvacuumcleanup() -- here it is in full (it hasn't really
changed since 2006, when it was updated by that same commit I cited):

/*
 * Post-VACUUM cleanup.
 *
 * Result: a palloc'd struct containing statistical info for VACUUM displays.
 */
IndexBulkDeleteResult *
hashvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
{
    Relation    rel = info->index;
    BlockNumber num_pages;

    /* If hashbulkdelete wasn't called, return NULL signifying no change */
    /* Note: this covers the analyze_only case too */
    if (stats == NULL)
        return NULL;

    /* update statistics */
    num_pages = RelationGetNumberOfBlocks(rel);
    stats->num_pages = num_pages;

    return stats;
}

Clearly hashvacuumcleanup() was considered by Tom when he revised the
documentation in 2006. Here are some observations about
hashvacuumcleanup() that seem relevant now:

* There is no "analyze_only" handling, just like nbtree.

"analyze_only" is only used by GIN, even now, 15+ years after it was
added. GIN uses it to make autovacuum workers (never VACUUM outside of
an AV worker) do pending list insertions for ANALYZE -- just to make
it happen more often.  This is a niche thing -- clearly we don't have
to care about it in nbtree, even if we make btvacuumcleanup() (almost)
always return NULL when there was no btbulkdelete() call.

* num_pages (which will become pg_class.relpages for the index) is not
set when we return NULL -- hashvacuumcleanup() assumes that ANALYZE
will get to it eventually in the case where VACUUM does no real work
(when it just returns NULL).

* We also use RelationGetNumberOfBlocks() to set pg_class.relpages for
index relations during ANALYZE -- it's called when we call
vac_update_relstats() (I quoted this do_analyze_rel() code to you
directly in a recent email).

* In general, pg_class.relpages isn't an estimate (because we use
RelationGetNumberOfBlocks(), both in the VACUUM-updates case and the
ANALYZE-updates case) -- only pg_class.reltuples is truly an estimate
during ANALYZE, and so getting a "true count" seems to have only
limited practical importance.

I think that this sets a precedent in support of my view that we can
simply get rid of vacuum_cleanup_index_scale_factor without any
special effort to maintain pg_class.reltuples. As I said before, we
can safely make btvacuumcleanup() just like hashvacuumcleanup(),
except when there are known deleted-but-not-recycled pages, where a
full index scan really is necessary for reasons that are not related
to statistics at all (of course we still need the *logic* that was
added to nbtree by the vacuum_cleanup_index_scale_factor commit --
that is clearly necessary). My guess is that Tom would have made
btvacuumcleanup() look identical to hashvacuumcleanup() in 2006 if
nbtree didn't have page deletion to consider -- but that had to be
considered.

My reasoning here is also based on the tendency of the core code to
mostly think of hash indexes as very similar to nbtree indexes.

Even though "the letter of the law" favors removing the
vacuum_cleanup_index_scale_factor GUC + param in the way I have
outlined, that is not the only thing that matters -- we must also
consider "the spirit of the law". Realistically, hash indexes are far
less popular than nbtree indexes, and so even if I am 100% correct in
theory, the real world might not be so convinced by my legalistic
argument. We've already seen the issue with VACUUM ANALYZE (which has
not been truly consistent with the behavior hashvacuumcleanup() for
many years). There might be more.

I suppose I could ask Tom what he thinks? The hardest question is what
to do in the backbranches...I really don't have a strong opinion right
now.

> > BTW, note that btvacuumcleanup set pg_class.reltuples to 0 in all
> > cases following the deduplication commit until my bug fix commit
> > 48e12913 (which was kind of a hack itself). This meant that the
> > statistics set by btvacuumcleanup (in the case where btbulkdelete
> > doesn't get called, the relevant case for
> > vacuum_cleanup_index_scale_factor). So it was 100% wrong for months
> > before anybody noticed (or at least until anybody complained).
> >
>
> Maybe we need more regression tests here.

I agree, but my point was that even a 100% broken approach to stats
within btvacuumcleanup() is not that noticeable. This supports the
idea that it just doesn't matter very much if a cleanup-only scan of
the index never takes place (or only takes place when we need to
recycle deleted pages, which is generally rare but will become very
rare once I commit the attached patch).

Also, my fix for this bug (commit 48e12913) was actually pretty bad;
there are now cases where the btvacuumcleanup()-only VACUUM case will
set pg_class.reltuples to a value that is significantly below what it
should be (it all depends on how effective deduplication is with the
data). I probably should have made btvacuumcleanup()-only VACUUMs set
"stats->estimate_count = true", purely to make sure that the core code
doesn't trust the statistics too much (it's okay for VACUUM VERBOSE
output only). Right now we can get a pg_class.reltuples that is
"exactly wrong" -- it would actually be a big improvement if it was
"approximately correct".

Another new concern for me (another concern unique to Postgres 13) is
autovacuum_vacuum_insert_scale_factor-driven autovacuums.

--
Peter Geoghegan

From 554f5ed05252d616641c05082bf3105d4d0d83f9 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 25 Feb 2021 15:17:22 -0800
Subject: [PATCH v7] Recycle pages deleted during same VACUUM.

Author: Peter Geoghegan <p...@bowt.ie>
Discussion: https://postgr.es/m/CAH2-Wzk76_P=67iuscb1un44-gyzl-kgpsxbsxq_bdcma7q...@mail.gmail.com
---
 src/include/access/nbtree.h         | 22 ++++++-
 src/backend/access/nbtree/README    | 31 +++++++++
 src/backend/access/nbtree/nbtpage.c | 40 ++++++++++++
 src/backend/access/nbtree/nbtree.c  | 97 +++++++++++++++++++++++++++++
 src/backend/access/nbtree/nbtxlog.c | 22 +++++++
 5 files changed, 211 insertions(+), 1 deletion(-)

diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index b56b7b7868..876b8f3437 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -279,7 +279,8 @@ BTPageGetDeleteXid(Page page)
  * Is an existing page recyclable?
  *
  * This exists to centralize the policy on which deleted pages are now safe to
- * re-use.
+ * re-use.  The _bt_newly_deleted_pages_recycle() optimization behaves more
+ * aggressively, though that has certain known limitations.
  *
  * Note: PageIsNew() pages are always safe to recycle, but we can't deal with
  * them here (caller is responsible for that case themselves).  Caller might
@@ -316,14 +317,33 @@ BTPageIsRecyclable(Page page)
  * BTVacState is private nbtree.c state used during VACUUM.  It is exported
  * for use by page deletion related code in nbtpage.c.
  */
+typedef struct BTPendingRecycle
+{
+	BlockNumber blkno;
+	FullTransactionId safexid;
+} BTPendingRecycle;
+
 typedef struct BTVacState
 {
+	/*
+	 * VACUUM operation state
+	 */
 	IndexVacuumInfo *info;
 	IndexBulkDeleteResult *stats;
 	IndexBulkDeleteCallback callback;
 	void	   *callback_state;
 	BTCycleId	cycleid;
+
+	/*
+	 * Page deletion state for VACUUM
+	 */
 	MemoryContext pagedelcontext;
+	BTPendingRecycle *deleted;
+	bool		grow;
+	bool		full;
+	uint32		ndeletedspace;
+	uint64		maxndeletedspace;
+	uint32		ndeleted;
 } BTVacState;
 
 /*
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 46d49bf025..265814ea46 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -430,6 +430,37 @@ whenever it is subsequently taken from the FSM for reuse.  The deleted
 page's contents will be overwritten by the split operation (it will become
 the new right sibling page).
 
+Prior to PostgreSQL 14, VACUUM was only able to recycle pages that were
+deleted by a previous VACUUM operation (VACUUM typically placed all pages
+deleted by the last VACUUM into the FSM, though there were and are no
+certainties here).  This had the obvious disadvantage of creating
+uncertainty about when and how pages get recycled, especially with bursty
+workloads.  It was naive, even within the constraints of the design, since
+there is no reason to think that it will take long for a deleted page to
+become recyclable.  It's convenient to use XIDs to implement the drain
+technique, but that is totally unrelated to any of the other things that
+VACUUM needs to do with XIDs.
+
+VACUUM operations now consider if it's possible to recycle any pages that
+the same operation deleted after the physical scan of the index, the last
+point it's convenient to do one last check.  This changes nothing about
+the basic design, and so it might still not be possible to recycle any
+pages at that time (e.g., there might not even be one single new
+transactions after an index page deletion, but before VACUUM ends).  But
+we have little to lose and plenty to gain by trying.  We only need to keep
+around a little information about recently deleted pages in local memory.
+We don't even have to access the deleted pages a second time.
+
+Currently VACUUM delays considering the possibility of recycling its own
+recently deleted page until the end of its btbulkdelete scan (or until the
+end of btvacuumcleanup in cases where there were no tuples to delete in
+the index).  It would be slightly more effective if btbulkdelete page
+deletions were deferred until btvacuumcleanup, simply because more time
+will have passed.  Our current approach works well enough in practice,
+especially in cases where it really matters: cases where we're vacuuming a
+large index, where recycling pages sooner rather than later is
+particularly likely to matter.
+
 Fastpath For Index Insertion
 ----------------------------
 
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 629a23628e..9d7d0186d0 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2687,6 +2687,46 @@ _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, BlockNumber scanblkno,
 	if (target <= scanblkno)
 		stats->pages_deleted++;
 
+	/*
+	 * Maintain array of pages that were deleted during current btvacuumscan()
+	 * call.  We may well be able to recycle them in a separate pass at the
+	 * end of the current btvacuumscan().
+	 *
+	 * Need to respect work_mem/maxndeletedspace limitation on size of deleted
+	 * array.  Our strategy when the array can no longer grow within the
+	 * bounds of work_mem is simple: keep earlier entries (which are likelier
+	 * to be recyclable in the end), but stop saving new entries.
+	 */
+	if (vstate->full)
+		return true;
+
+	if (vstate->ndeleted >= vstate->ndeletedspace)
+	{
+		uint64 newndeletedspace;
+
+		if (!vstate->grow)
+		{
+			vstate->full = true;
+			return true;
+		}
+
+		newndeletedspace = vstate->ndeletedspace * 2;
+		if (newndeletedspace > vstate->maxndeletedspace)
+		{
+			newndeletedspace = vstate->maxndeletedspace;
+			vstate->grow = false;
+		}
+		vstate->ndeletedspace = newndeletedspace;
+
+		vstate->deleted =
+			repalloc(vstate->deleted,
+					 sizeof(BTPendingRecycle) * vstate->ndeletedspace);
+	}
+
+	vstate->deleted[vstate->ndeleted].blkno = target;
+	vstate->deleted[vstate->ndeleted].safexid = safexid;
+	vstate->ndeleted++;
+
 	return true;
 }
 
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 504f5bef17..8aed93ff0a 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -21,7 +21,9 @@
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
 #include "access/relscan.h"
+#include "access/table.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
 #include "miscadmin.h"
@@ -32,6 +34,7 @@
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
+#include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
@@ -860,6 +863,71 @@ _bt_vacuum_needs_cleanup(IndexVacuumInfo *info)
 	return false;
 }
 
+/*
+ * _bt_newly_deleted_pages_recycle() -- Are _bt_pagedel pages recyclable now?
+ *
+ * Note that we assume that the array is ordered by safexid.  No further
+ * entries can be safe to recycle once we encounter the first non-recyclable
+ * entry in the deleted array.
+ */
+static inline void
+_bt_newly_deleted_pages_recycle(Relation rel, BTVacState *vstate)
+{
+	IndexBulkDeleteResult *stats = vstate->stats;
+	Relation	heapRel;
+
+	Assert(vstate->ndeleted > 0);
+	Assert(stats->pages_newly_deleted >= vstate->ndeleted);
+
+	/*
+	 * Recompute VACUUM XID boundaries.
+	 *
+	 * We don't actually care about the oldest non-removable XID.  Computing
+	 * the oldest such XID has a useful side-effect: It updates the procarray
+	 * state that tracks XID horizon.  This is not just an optimization; it's
+	 * essential.  It allows the GlobalVisCheckRemovableFullXid() calls we
+	 * make here to notice if and when safexid values from pages this same
+	 * VACUUM operation deleted are sufficiently old to allow recycling to
+	 * take place safely.
+	 */
+	GetOldestNonRemovableTransactionId(NULL);
+
+	/*
+	 * Use the heap relation for GlobalVisCheckRemovableFullXid() calls (don't
+	 * pass NULL rel argument).
+	 *
+	 * This is an optimization; it allows us to be much more aggressive in
+	 * cases involving logical decoding (unless this happens to be a system
+	 * catalog).  We don't simply use BTPageIsRecyclable().
+	 *
+	 * XXX: The BTPageIsRecyclable() criteria creates problems for this
+	 * optimization.  Its safexid test is applied in a redundant manner within
+	 * _bt_getbuf() (via its BTPageIsRecyclable() call).  Consequently,
+	 * _bt_getbuf() may believe that it is still unsafe to recycle a page that
+	 * we know to be recycle safe -- in which case it is unnecessarily
+	 * discarded.
+	 *
+	 * We should get around to fixing this _bt_getbuf() issue some day.  For
+	 * now we can still proceed in the hopes that BTPageIsRecyclable() will
+	 * catch up with us before _bt_getbuf() ever reaches the page.
+	 */
+	heapRel = table_open(IndexGetRelation(RelationGetRelid(rel), false),
+						 AccessShareLock);
+	for (int i = 0; i < vstate->ndeleted; i++)
+	{
+		BlockNumber blkno = vstate->deleted[i].blkno;
+		FullTransactionId safexid = vstate->deleted[i].safexid;
+
+		if (!GlobalVisCheckRemovableFullXid(heapRel, safexid))
+			break;
+
+		RecordFreeIndexPage(rel, blkno);
+		stats->pages_free++;
+	}
+
+	table_close(heapRel, AccessShareLock);
+}
+
 /*
  * Bulk deletion of all index entries pointing to a set of heap tuples.
  * The set of target tuples is specified via a callback routine that tells
@@ -945,6 +1013,14 @@ btvacuumcleanup(IndexVacuumInfo *info, IndexBulkDeleteResult *stats)
 	 * _bt_vacuum_needs_cleanup() to force the next VACUUM to proceed with a
 	 * btvacuumscan() call.
 	 *
+	 * Note: Prior to PostgreSQL 14, we were completely reliant on the next
+	 * VACUUM operation taking care of recycling whatever pages the current
+	 * VACUUM operation found to be empty and then deleted.  It is now usually
+	 * possible for _bt_newly_deleted_pages_recycle() to recycle all of the
+	 * pages that any given VACUUM operation deletes, as part of the same
+	 * VACUUM operation.  As a result, it is rare for num_delpages to actually
+	 * exceed 0, including with indexes where page deletions are frequent.
+	 *
 	 * Note: We must delay the _bt_set_cleanup_info() call until this late
 	 * stage of VACUUM (the btvacuumcleanup() phase), to keep num_heap_tuples
 	 * accurate.  The btbulkdelete()-time num_heap_tuples value is generally
@@ -1033,6 +1109,16 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 												  "_bt_pagedel",
 												  ALLOCSET_DEFAULT_SIZES);
 
+	/* Allocate _bt_newly_deleted_pages_recycle related information */
+	vstate.ndeletedspace = 512;
+	vstate.grow = true;
+	vstate.full = false;
+	vstate.maxndeletedspace = ((work_mem * 1024L) / sizeof(BTPendingRecycle));
+	vstate.maxndeletedspace = Min(vstate.maxndeletedspace, MaxBlockNumber);
+	vstate.maxndeletedspace = Max(vstate.maxndeletedspace, vstate.ndeletedspace);
+	vstate.ndeleted = 0;
+	vstate.deleted = palloc(sizeof(BTPendingRecycle) * vstate.ndeletedspace);
+
 	/*
 	 * The outer loop iterates over all index pages except the metapage, in
 	 * physical order (we hope the kernel will cooperate in providing
@@ -1101,7 +1187,18 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	 *
 	 * Note that if no recyclable pages exist, we don't bother vacuuming the
 	 * FSM at all.
+	 *
+	 * Before vacuuming the FSM, try to make the most of the pages we
+	 * ourselves deleted: see if they can be recycled already (try to avoid
+	 * waiting until the next VACUUM operation to recycle).  Our approach is
+	 * to check the local array of pages that were newly deleted during this
+	 * VACUUM.
 	 */
+	if (vstate.ndeleted > 0)
+		_bt_newly_deleted_pages_recycle(rel, &vstate);
+
+	pfree(vstate.deleted);
+
 	if (stats->pages_free > 0)
 		IndexFreeSpaceMapVacuum(rel);
 }
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 8b7c143db4..6ab9af4a43 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -999,6 +999,28 @@ btree_xlog_newroot(XLogReaderState *record)
  * the PGPROC->xmin > limitXmin test inside GetConflictingVirtualXIDs().
  * Consequently, one XID value achieves the same exclusion effect on primary
  * and standby.
+ *
+ * XXX It would make a great deal more sense if each nbtree index's FSM (or
+ * some equivalent structure) was completely crash-safe.  Importantly, this
+ * would enable page recycling's REDO side to work in a way that naturally
+ * matches original execution.
+ *
+ * Page deletion has to be crash safe already, plus xl_btree_reuse_page
+ * records are logged any time a backend has to recycle -- full crash safety
+ * is unlikely to add much overhead, and has clear efficiency benefits.  It
+ * would also simplify things by more explicitly decoupling page deletion and
+ * page recycling.  The benefits for REDO all follow from that.
+ *
+ * Under this scheme, the whole question of recycle safety could be moved from
+ * VACUUM to the consumer side.  That is, VACUUM would no longer have to defer
+ * placing a page that it deletes in the FSM until BTPageIsRecyclable() starts
+ * to return true -- _bt_getbut() would handle all details of safely deferring
+ * recycling instead.  _bt_getbut() would use the improved/crash-safe FSM to
+ * explicitly find a free page whose safexid is sufficiently old for recycling
+ * to be safe from the point of view of backends that run during original
+ * execution.  That just leaves the REDO side.  Instead of xl_btree_reuse_page
+ * records, we'd have FSM "consume/recycle page from the FSM" records that are
+ * associated with FSM page buffers/blocks.
  */
 static void
 btree_xlog_reuse_page(XLogReaderState *record)
-- 
2.27.0

Re: 64-bit XIDs in deleted nbtree pages

Reply via email to