(now really to -hackers)
Hi,

Over at [0] I'd implemented an optimization that allows us to skip
calling _bt_compare in _bt_moveright in many common cases. This patch,
when stacked on top of the prefix truncation patch, improves INSERT
performance by an additional 2-9%pt, with an extreme case of 45% in
the worscase index tests at [0].

The optimization is that we now recognze that our page split algorithm
all but guarantees that the HIKEY matches this page's downlink's right
separator key bytewise, excluding the data stored in the
IndexTupleData struct.

By caching the right separator index tuple in _bt_search, we can
compare the downlink's right separator and the HIKEY, and when they
are equal (memcmp() == 0) we don't have to call _bt_compare - the
HIKEY is known to be larger than the scan key, because our key is
smaller than the right separator, and thus transitively also smaller
than the HIKEY because it contains the same data. As _bt_compare can
call expensive user-provided functions, this can be a large
performance boon, especially when there are only a small number of
column getting compared on each page (e.g. index tuples of many 100s
of bytes, or dynamic prefix truncation is enabled).

By adding this, the number of _bt_compare calls per _bt_search is
often reduced by one per btree level.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

PS. Best served with dynamic prefix truncation [1] and btree specialization [0].

[0] 
https://www.postgresql.org/message-id/CAEze2WiqOONRQTUT1p_ZV19nyMA69UU2s0e2dp+jSBM=j8s...@mail.gmail.com
[1] 
https://www.postgresql.org/message-id/flat/CAEze2Wh-h20DmPSMXp4qHR0-ykh9=z3ejx8mssbikboqaye...@mail.gmail.com
From 55a2d06037f530b6d79bc73ed21bd27b78a1cc53 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Sun, 29 Oct 2023 21:39:23 +0100
Subject: [PATCH v1] btree: optimize _bt_moveright using downlink's right
 separator

Due to the inner workings of the btree, it is extremely likely that the
right separator of the btree page's downlink on the parent page matches
the downlinked page's HIKEY: Only when the page was split after we
accessed its parent page (and this split wasn't completed yet) the HIKEY
will not match.

So, instead of doing _bt_compare in _bt_moveright, we can store the
right separator key of the downlink, and check if it matches the HIKEY of
the linked page. If they match, we don't have to call the relatively
expensive _bt_compare, which allows us to reuse the _bt_compare result of
the right separator key.

This reduces the number of all-attribute _bt_compare operations
we need to do on a page by 1 flat, thus increasing performance
of indexes that have expensive compare operations.

In passing, move the declaration of _bt_moveright into nbtsearch.c - I
found no user of the function anywhere but in nbtsearch.c.
---
 src/backend/access/nbtree/README      | 20 +++++++++
 src/backend/access/nbtree/nbtsearch.c | 65 +++++++++++++++++++++++++--
 src/include/access/nbtree.h           |  3 --
 3 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 52e646c7f7..c75793da5a 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -901,6 +901,26 @@ large groups of duplicates, maximizing space utilization.  Note also that
 deduplication more efficient.  Deduplication can be performed infrequently,
 without merging together existing posting list tuples too often.
 
+Notes about the right separator/HIKEY moveright optimization
+------------------------------------------------------------
+
+Page splits consistently cause the HIKEY of the split page to be inserted
+into the parent page as the right separator of the page's downlink (or,
+as the split page's new right sibling's left link), with the only
+difference between the HIKEY and the separator being the contents of the
+IndexTupleData struct of these tuples: the payloads are bit-identical.
+This allows us to reuse the _bt_compare result of the right separator key
+for the downlinked page's HIKEY, if we can determine that those are indeed
+bit-identical: Concurrent page splits and deletions may have caused the
+downlinked page to get a different HIKEY, which could have a different
+_bt_compare result.  To make this work, in _bt_search we cache the
+current downlink's right separator key, to then in the _bt_moveright
+phase of the layer below use memcmp() to validate our assumptions about
+the HIKEY matching the downlink's right separator key.  Only if the
+assumption is proven wrong (memcmp(HIKEY, right_sep) != 0), we call
+_bt_compare(), otherwise we can be certain that the parent page's result
+is unchanged, i.e. that _bt_compare would return "<".
+
 Notes about deduplication
 -------------------------
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index efc5284e5b..602a0f45e1 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -27,6 +27,9 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
+							Buffer buf, bool forupdate, BTStack stack,
+							int access, IndexTuple rightsep);
 static int	_bt_binsrch_posting(BTScanInsert key, Page page,
 								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
@@ -98,6 +101,16 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 {
 	BTStack		stack_in = NULL;
 	int			page_access = BT_READ;
+	union {
+		char	data[MAXALIGN_DOWN(BLCKSZ / 3)];
+		IndexTupleData tuple;
+		double	force_align_d;
+	}			rsepbuf
+#ifdef pg_attribute_aligned
+	pg_attribute_aligned(MAXIMUM_ALIGNOF)
+#endif
+	;
+	IndexTuple	rightsep = NULL;
 
 	/* heaprel must be set whenever _bt_allocbuf is reachable */
 	Assert(access == BT_READ || access == BT_WRITE);
@@ -134,7 +147,7 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * opportunity to finish splits of internal pages too.
 		 */
 		*bufP = _bt_moveright(rel, heaprel, key, *bufP, (access == BT_WRITE),
-							  stack_in, page_access);
+							  stack_in, page_access, rightsep);
 
 		/* if this is a leaf page, we're done */
 		page = BufferGetPage(*bufP);
@@ -152,6 +165,25 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		Assert(BTreeTupleIsPivot(itup) || !key->heapkeyspace);
 		child = BTreeTupleGetDownLink(itup);
 
+		if (offnum < PageGetMaxOffsetNumber(page))
+		{
+			ItemId	rightsepitem = PageGetItemId(page, offnum + 1);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+		else if (!P_RIGHTMOST(opaque))
+		{
+			/*
+			 * The rightmost data tuple on inner page has P_HIKEY as its right
+			 * separator.
+			 */
+			ItemId	rightsepitem = PageGetItemId(page, P_HIKEY);
+			IndexTuple pagerightsep = (IndexTuple) PageGetItem(page, rightsepitem);
+			memcpy(rsepbuf.data, pagerightsep, ItemIdGetLength(rightsepitem));
+			rightsep = &rsepbuf.tuple;
+		}
+
 		/*
 		 * We need to save the location of the pivot tuple we chose in a new
 		 * stack entry for this page/level.  If caller ends up splitting a
@@ -194,7 +226,8 @@ _bt_search(Relation rel, Relation heaprel, BTScanInsert key, Buffer *bufP,
 		 * but before we acquired a write lock.  If it has, we may need to
 		 * move right to its new sibling.  Do that.
 		 */
-		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE);
+		*bufP = _bt_moveright(rel, heaprel, key, *bufP, true, stack_in, BT_WRITE,
+							  rightsep);
 	}
 
 	return stack_in;
@@ -238,7 +271,8 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  bool forupdate,
 			  BTStack stack,
-			  int access)
+			  int access,
+			  IndexTuple rightsep)
 {
 	Page		page;
 	BTPageOpaque opaque;
@@ -297,7 +331,30 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, key, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque))
+		{
+			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
+			continue;
+		}
+
+		if (PointerIsValid(rightsep))
+		{
+			/*
+			 * Note: we're not in the rightmost page (see branchpoint earlier in
+			 * the loop), so we always have a HIKEY on this page.
+			 */
+			ItemId	hikeyid = PageGetItemId(page, P_HIKEY);
+			IndexTuple highkey = (IndexTuple) PageGetItem(page, hikeyid);
+
+			if (ItemIdGetLength(hikeyid) == IndexTupleSize(rightsep) &&
+				memcmp(&highkey[1], &rightsep[1],
+					   IndexTupleSize(rightsep) - sizeof(IndexTupleData)) == 0)
+			{
+				break;
+			}
+		}
+
+		if (_bt_compare(rel, key, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 7bfbf3086c..fa7fc03362 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -1237,9 +1237,6 @@ extern void _bt_pendingfsm_finalize(Relation rel, BTVacState *vstate);
  */
 extern BTStack _bt_search(Relation rel, Relation heaprel, BTScanInsert key,
 						  Buffer *bufP, int access);
-extern Buffer _bt_moveright(Relation rel, Relation heaprel, BTScanInsert key,
-							Buffer buf, bool forupdate, BTStack stack,
-							int access);
 extern OffsetNumber _bt_binsrch_insert(Relation rel, BTInsertState insertstate);
 extern int32 _bt_compare(Relation rel, BTScanInsert key, Page page, OffsetNumber offnum);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
-- 
2.40.1

Reply via email to