Re: [WIP] [B-Tree] Retail IndexTuple deletion

Andrey V. Lepikhov Mon, 02 Jul 2018 07:30:42 -0700


On 29.06.2018 14:07, Юрий Соколов wrote:

чт, 28 июн. 2018 г., 8:37 Andrey V. Lepikhov <a.lepik...@postgrespro.ru<mailto:a.lepik...@postgrespro.ru>>:




    On 28.06.2018 05:00, Peter Geoghegan wrote:
     > On Tue, Jun 26, 2018 at 11:40 PM, Andrey V. Lepikhov
     > <a.lepik...@postgrespro.ru <mailto:a.lepik...@postgrespro.ru>> wrote:
     >> I still believe that the patch for physical TID ordering in btree:
     >> 1) has its own value, not only for target deletion,
     >> 2) will require only a few local changes in my code,
     >> and this patches can be developed independently.
     >
     > I want to be clear on something now: I just don't think that this
     > patch has any chance of getting committed without something like my
     > own patch to go with it. The worst case for your patch without that
     > component is completely terrible. It's not really important for
    you to
     > actually formally make it part of your patch, so I'm not going to
     > insist on that or anything, but the reality is that my patch does not
     > have independent value -- and neither does yours.
     >
    As I wrote before in the last email, I will integrate TID sorting to my
    patches right now. Please, give me access to the last version of your
    code, if it possible.
    You can track the progress at https://github.com/danolivo/postgres git
    repository


Peter is absolutely right, imho: tie-breaking by TID within index
  ordering is inevitable for reliable performance of this patch.

In the new version the patch [1] was used in cooperation with 'retailindextuple deletion' and 'quick vacuum strategy' patches (see'0004-Retail-IndexTuple-Deletion-with-TID-sorting-in-leaf-.patch'.


To demonstrate the potential benefits, I did a test:

CREATE TABLE test (id serial primary key, value integer, factor integer);

INSERT INTO test (value, factor) SELECT random()*1e5, random()*1e3 FROMgenerate_series(1, 1e7);

CREATE INDEX ON test(value);
VACUUM;
DELETE FROM test WHERE (factor = 1);
VACUUM test;

Execution time of last "VACUUM test;" command on my notebook was:

with bulk deletion: 1.6 s;
with Quick Vacuum Strategy: 5.2 s;
with Quick Vacuum Strategy & TID sorting: 0.6 s.

[1]https://www.postgresql.org/message-id/CAH2-WzkVb0Kom%3DR%2B88fDFb%3DJSxZMFvbHVC6Mn9LJ2n%3DX%3DkS-Uw%40mail.gmail.com

With regards,
Sokolov Yura.


--
Andrey Lepikhov
Postgres Professional:
https://postgrespro.com
The Russian Postgres Company

>From 7f2384691de592bf4d11cc8b4d75eca5500cd500 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepik...@postgrespro.ru>
Date: Mon, 2 Jul 2018 16:13:08 +0500
Subject: [PATCH 1/4] Retail-IndexTuple-Deletion-Access-Method

---
 contrib/bloom/blutils.c              |   1 +
 src/backend/access/brin/brin.c       |   1 +
 src/backend/access/gin/ginutil.c     |   1 +
 src/backend/access/gist/gist.c       |   1 +
 src/backend/access/hash/hash.c       |   1 +
 src/backend/access/index/indexam.c   |  15 ++++
 src/backend/access/nbtree/nbtree.c   | 138 +++++++++++++++++++++++++++++++++++
 src/backend/access/spgist/spgutils.c |   1 +
 src/include/access/amapi.h           |   6 ++
 src/include/access/genam.h           |  18 +++++
 src/include/access/nbtree.h          |   4 +
 11 files changed, 187 insertions(+)

diff --git a/contrib/bloom/blutils.c b/contrib/bloom/blutils.c
index 6b2b9e3..96f1d47 100644
--- a/contrib/bloom/blutils.c
+++ b/contrib/bloom/blutils.c
@@ -126,6 +126,7 @@ blhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = blbuild;
 	amroutine->ambuildempty = blbuildempty;
 	amroutine->aminsert = blinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = blbulkdelete;
 	amroutine->amvacuumcleanup = blvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index e95fbbc..a0e06bd 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -103,6 +103,7 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = brinbuild;
 	amroutine->ambuildempty = brinbuildempty;
 	amroutine->aminsert = brininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = brinbulkdelete;
 	amroutine->amvacuumcleanup = brinvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index 0a32182..acf14e7 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -58,6 +58,7 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = ginbuild;
 	amroutine->ambuildempty = ginbuildempty;
 	amroutine->aminsert = gininsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = ginbulkdelete;
 	amroutine->amvacuumcleanup = ginvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 8a42eff..d7a53d2 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -80,6 +80,7 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = gistbuild;
 	amroutine->ambuildempty = gistbuildempty;
 	amroutine->aminsert = gistinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = gistbulkdelete;
 	amroutine->amvacuumcleanup = gistvacuumcleanup;
 	amroutine->amcanreturn = gistcanreturn;
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0002df3..5fb32d6 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -76,6 +76,7 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = hashbuild;
 	amroutine->ambuildempty = hashbuildempty;
 	amroutine->aminsert = hashinsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = hashbulkdelete;
 	amroutine->amvacuumcleanup = hashvacuumcleanup;
 	amroutine->amcanreturn = NULL;
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index 22b5cc9..9ebeb78 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -730,6 +730,21 @@ index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap)
 	return ntids;
 }
 
+IndexTargetDeleteResult *
+index_target_delete(IndexTargetDeleteInfo *info,
+					IndexTargetDeleteResult *stats,
+					Datum *values,
+					bool *isnull)
+{
+	Relation indexRelation = info->indexRelation;
+
+	RELATION_CHECKS;
+
+	CHECK_REL_PROCEDURE(amtargetdelete);
+
+	return indexRelation->rd_amroutine->amtargetdelete(info, stats, values, isnull);
+}
+
 /* ----------------
  *		index_bulk_delete - do mass deletion of index entries
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 27a3032..c54aeac 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -34,7 +34,9 @@
 #include "storage/smgr.h"
 #include "utils/builtins.h"
 #include "utils/index_selfuncs.h"
+#include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/syscache.h"
 
 
 /* Working state needed by btvacuumpage */
@@ -127,6 +129,7 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = btbuild;
 	amroutine->ambuildempty = btbuildempty;
 	amroutine->aminsert = btinsert;
+	amroutine->amtargetdelete = bttargetdelete;
 	amroutine->ambulkdelete = btbulkdelete;
 	amroutine->amvacuumcleanup = btvacuumcleanup;
 	amroutine->amcanreturn = btcanreturn;
@@ -884,6 +887,141 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	return stats;
 }
 
+static int
+tid_list_search(ItemPointer tid, ItemPointer tid_list, int ntid)
+{
+	for (int i = 0; i < ntid; i++)
+		if (ItemPointerEquals(tid, &(tid_list[i])))
+			return i;
+	return -1;
+}
+
+IndexTargetDeleteResult*
+bttargetdelete(IndexTargetDeleteInfo *info,
+			   IndexTargetDeleteResult *stats,
+			   Datum *values,
+			   bool *isnull)
+{
+	Relation		irel = info->indexRelation;
+	Relation		hrel = info->heapRelation;
+	ScanKey			skey;
+	int				keysCount = IndexRelationGetNumberOfKeyAttributes(irel);
+	BTStack			stack;
+	Buffer			buf;
+	Page			page;
+	BTPageOpaque	opaque;
+	OffsetNumber	offnum;
+	int				ndeletable = 0;
+	OffsetNumber	deletable[MaxOffsetNumber];
+	IndexTuple		itup;
+
+	if (stats == NULL)
+		stats = (IndexTargetDeleteResult *) palloc0(sizeof(IndexTargetDeleteResult));
+
+	itup = index_form_tuple(RelationGetDescr(irel), values, isnull);
+	skey = _bt_mkscankey(irel, itup);
+
+	/* Descend the tree and position ourselves on the target leaf page. */
+	stack = _bt_search(irel, keysCount, skey, false, &buf, BT_READ, NULL);
+	_bt_freestack(stack);
+
+	/* To prepare tuple entries search across index pages */
+	Assert(BufferIsValid(buf));
+	offnum = _bt_binsrch(irel, buf, keysCount, skey, false);
+	page = BufferGetPage(buf);
+	_bt_checkpage(irel, buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	for (;;)
+	{
+		int32		cmpval;
+		ItemId		itemid;
+		IndexTuple	itup;
+		int			pos;
+
+		/* Switch to the next page */
+		if (P_IGNORE(opaque) || (offnum > PageGetMaxOffsetNumber(page)))
+		{
+			/*
+			 * Before unlocking index page we need to delete
+			 * all currently found tuples
+			 */
+			if (ndeletable > 0)
+			{
+				LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+				LockBufferForCleanup(buf);
+
+				_bt_delitems_delete(irel, buf, deletable, ndeletable, hrel);
+
+				stats->tuples_removed += ndeletable;
+				ndeletable = 0;
+			}
+
+			/*
+			 * Check for end-of-index
+			 */
+			if (P_RIGHTMOST(opaque))
+				/* it is rightmost leaf */
+				break;
+
+			/*
+			 * Switch to the next index page
+			 */
+			buf = _bt_relandgetbuf(irel, buf, opaque->btpo_next, BT_READ);
+			page = BufferGetPage(buf);
+			_bt_checkpage(irel, buf);
+			opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+			offnum = P_FIRSTDATAKEY(opaque);
+			continue;
+		}
+
+		/*
+		 * This index entry satisfied to the scan key?
+		 */
+		cmpval = _bt_compare(irel, keysCount, skey, page, offnum);
+
+		if (cmpval != 0)
+			/* End of index entries, satisfied to the scan key */
+			break;
+
+		/*
+		 * To load index tuple and look for matches in the TID list of
+		 * dead heap tuples
+		 */
+		itemid = PageGetItemId(page, offnum);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+		pos = tid_list_search(&(itup->t_tid), info->dead_tuples, info->num_dead_tuples);
+
+		if ((pos >= 0) && (!info->found_dead_tuples[pos]))
+		{
+			/* index entry for TID of dead tuple is found */
+			deletable[ndeletable++] = offnum;
+			info->found_dead_tuples[pos] = true;
+		}
+
+		offnum = OffsetNumberNext(offnum);
+	}
+
+	/*
+	 * Delete all found index tuples
+	 */
+	if (ndeletable > 0)
+	{
+		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
+		LockBufferForCleanup(buf);
+
+		_bt_delitems_delete(irel, buf, deletable, ndeletable, hrel);
+
+		stats->tuples_removed += ndeletable;
+	}
+
+	/* Release scan key, unpin and unlock buffer */
+	_bt_freeskey(skey);
+	_bt_relbuf(irel, buf);
+
+	return stats;
+}
+
 /*
  * Post-VACUUM cleanup.
  *
diff --git a/src/backend/access/spgist/spgutils.c b/src/backend/access/spgist/spgutils.c
index 4a9b5da..3257281 100644
--- a/src/backend/access/spgist/spgutils.c
+++ b/src/backend/access/spgist/spgutils.c
@@ -56,6 +56,7 @@ spghandler(PG_FUNCTION_ARGS)
 	amroutine->ambuild = spgbuild;
 	amroutine->ambuildempty = spgbuildempty;
 	amroutine->aminsert = spginsert;
+	amroutine->amtargetdelete = NULL;
 	amroutine->ambulkdelete = spgbulkdelete;
 	amroutine->amvacuumcleanup = spgvacuumcleanup;
 	amroutine->amcanreturn = spgcanreturn;
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 14526a6..497b54e 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -76,6 +76,11 @@ typedef bool (*aminsert_function) (Relation indexRelation,
 								   IndexUniqueCheck checkUnique,
 								   struct IndexInfo *indexInfo);
 
+/* target delete */
+typedef IndexTargetDeleteResult *(*amtargetdelete_function) (IndexTargetDeleteInfo *info,
+															 IndexTargetDeleteResult *stats,
+															 Datum *values,
+															 bool *isnull);
 /* bulk delete */
 typedef IndexBulkDeleteResult *(*ambulkdelete_function) (IndexVacuumInfo *info,
 														 IndexBulkDeleteResult *stats,
@@ -207,6 +212,7 @@ typedef struct IndexAmRoutine
 	ambuild_function ambuild;
 	ambuildempty_function ambuildempty;
 	aminsert_function aminsert;
+	amtargetdelete_function amtargetdelete;
 	ambulkdelete_function ambulkdelete;
 	amvacuumcleanup_function amvacuumcleanup;
 	amcanreturn_function amcanreturn;	/* can be NULL */
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 24c720b..79b386e 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -51,6 +51,15 @@ typedef struct IndexVacuumInfo
 	BufferAccessStrategy strategy;	/* access strategy for reads */
 } IndexVacuumInfo;
 
+typedef struct IndexTargetDeleteInfo
+{
+	Relation	heapRelation;
+	Relation	indexRelation;			/* the index being vacuumed */
+	int			num_dead_tuples;
+	ItemPointer	dead_tuples;
+	bool*		found_dead_tuples;
+} IndexTargetDeleteInfo;
+
 /*
  * Struct for statistics returned by ambulkdelete and amvacuumcleanup
  *
@@ -79,6 +88,11 @@ typedef struct IndexBulkDeleteResult
 	BlockNumber pages_free;		/* # pages available for reuse */
 } IndexBulkDeleteResult;
 
+typedef struct IndexTargetDeleteResult
+{
+	int		tuples_removed; /* # removed during vacuum operation */
+} IndexTargetDeleteResult;
+
 /* Typedef for callback function to determine if a tuple is bulk-deletable */
 typedef bool (*IndexBulkDeleteCallback) (ItemPointer itemptr, void *state);
 
@@ -163,6 +177,10 @@ extern HeapTuple index_fetch_heap(IndexScanDesc scan);
 extern HeapTuple index_getnext(IndexScanDesc scan, ScanDirection direction);
 extern int64 index_getbitmap(IndexScanDesc scan, TIDBitmap *bitmap);
 
+extern IndexTargetDeleteResult *index_target_delete(IndexTargetDeleteInfo *info,
+													IndexTargetDeleteResult *stats,
+													Datum *values,
+													bool *isnull);
 extern IndexBulkDeleteResult *index_bulk_delete(IndexVacuumInfo *info,
 				  IndexBulkDeleteResult *stats,
 				  IndexBulkDeleteCallback callback,
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 04ecb4c..3b9461c 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -508,6 +508,10 @@ extern void btparallelrescan(IndexScanDesc scan);
 extern void btendscan(IndexScanDesc scan);
 extern void btmarkpos(IndexScanDesc scan);
 extern void btrestrpos(IndexScanDesc scan);
+extern IndexTargetDeleteResult *bttargetdelete(IndexTargetDeleteInfo *info,
+		IndexTargetDeleteResult *stats,
+		   Datum *values,
+		   bool *isnull);
 extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 			 IndexBulkDeleteResult *stats,
 			 IndexBulkDeleteCallback callback,
-- 
2.7.4

>From 904d204c36a46d9f8d7433af37bbd76031516ad7 Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepik...@postgrespro.ru>
Date: Mon, 2 Jul 2018 16:17:01 +0500
Subject: [PATCH 2/4] Quick-vacuum-strategy

---
 src/backend/access/heap/heapam.c    |  31 +++++++
 src/backend/access/heap/pruneheap.c |  40 ++++++++-
 src/backend/commands/vacuumlazy.c   | 161 +++++++++++++++++++++++++++++++++++-
 src/backend/utils/misc/guc.c        |  10 ++-
 src/include/access/heapam.h         |   1 +
 5 files changed, 234 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 72395a5..dd03366 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9393,6 +9393,36 @@ heap_sync(Relation rel)
 }
 
 /*
+ * Mask DEAD tuples at a HEAP page
+ *
+ * We introduce this function for wal_consistency_checking satisfaction at
+ * Hot Standby node.
+ *
+ * DEAD tuple on a master node has t_cid and t_infomask, which can not be
+ * obtained by WAL records applying on a Hot Standby node.
+ */
+static void
+mask_dead_tuples(Page page)
+{
+	OffsetNumber	offnum;
+
+	for (offnum = FirstOffsetNumber; offnum <= PageGetMaxOffsetNumber(page); offnum = OffsetNumberNext(offnum))
+	{
+		ItemId	lp = PageGetItemId(page, offnum);
+		char*	htup_begin;
+
+		if (!ItemIdIsDead(lp))
+			continue;
+
+		if (!ItemIdHasStorage(lp))
+			continue;
+
+		htup_begin = (char *) PageGetItem(page, lp);
+		memset(htup_begin, MASK_MARKER, ItemIdGetLength(lp));
+	}
+}
+
+/*
  * Mask a heap page before performing consistency checks on it.
  */
 void
@@ -9405,6 +9435,7 @@ heap_mask(char *pagedata, BlockNumber blkno)
 
 	mask_page_hint_bits(page);
 	mask_unused_space(page);
+	mask_dead_tuples(page);
 
 	for (off = 1; off <= PageGetMaxOffsetNumber(page); off++)
 	{
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index c2f5343..ebbafe3 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -43,6 +43,9 @@ typedef struct
 	bool		marked[MaxHeapTuplesPerPage + 1];
 } PruneState;
 
+/* Parameter for target deletion strategy in lazy vacuum */
+double target_index_deletion_factor = 0.01;
+
 /* Local functions */
 static int heap_prune_chain(Relation relation, Buffer buffer,
 				 OffsetNumber rootoffnum,
@@ -405,7 +408,7 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 			if (HeapTupleSatisfiesVacuum(&tup, OldestXmin, buffer)
 				== HEAPTUPLE_DEAD && !HeapTupleHeaderIsHotUpdated(htup))
 			{
-				heap_prune_record_unused(prstate, rootoffnum);
+				heap_prune_record_dead(prstate, rootoffnum);
 				HeapTupleHeaderAdvanceLatestRemovedXid(htup,
 													   &prstate->latestRemovedXid);
 				ndeleted++;
@@ -580,8 +583,10 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 */
 		for (i = 1; (i < nchain) && (chainitems[i - 1] != latestdead); i++)
 		{
-			heap_prune_record_unused(prstate, chainitems[i]);
 			ndeleted++;
+			if (chainitems[i] == latestdead)
+				continue;
+			heap_prune_record_unused(prstate, chainitems[i]);
 		}
 
 		/*
@@ -598,9 +603,28 @@ heap_prune_chain(Relation relation, Buffer buffer, OffsetNumber rootoffnum,
 		 * redirect the root to the correct chain member.
 		 */
 		if (i >= nchain)
+		{
+			if (rootoffnum != latestdead)
+			{
+				if (ItemIdIsNormal(rootlp))
+					heap_prune_record_unused(prstate, latestdead);
+				else
+				{
+					/*
+					 * We allow overlapping of redirected and dead items
+					 */
+					heap_prune_record_redirect(prstate, rootoffnum, latestdead);
+					heap_prune_record_dead(prstate, latestdead);
+				}
+			}
 			heap_prune_record_dead(prstate, rootoffnum);
+		}
 		else
+		{
+			if (rootoffnum != latestdead)
+				heap_prune_record_unused(prstate, latestdead);
 			heap_prune_record_redirect(prstate, rootoffnum, chainitems[i]);
+		}
 	}
 	else if (nchain < 2 && ItemIdIsRedirected(rootlp))
 	{
@@ -653,7 +677,12 @@ heap_prune_record_dead(PruneState *prstate, OffsetNumber offnum)
 	Assert(prstate->ndead < MaxHeapTuplesPerPage);
 	prstate->nowdead[prstate->ndead] = offnum;
 	prstate->ndead++;
-	Assert(!prstate->marked[offnum]);
+
+	/*
+	 * We suppress checking prstate->marked[offnum]. It is not the best idea,
+	 * but this is most simplistic way to enable Dead Redirecting by
+	 * overlapping Dead and Redirected states.
+	 */
 	prstate->marked[offnum] = true;
 }
 
@@ -706,7 +735,10 @@ heap_page_prune_execute(Buffer buffer,
 		OffsetNumber off = *offnum++;
 		ItemId		lp = PageGetItemId(page, off);
 
-		ItemIdSetDead(lp);
+		if (target_index_deletion_factor > 0)
+			ItemIdMarkDead(lp);
+		else
+			ItemIdSetDead(lp);
 	}
 
 	/* Update all now-unused line pointers */
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 5649a70..0bb8abb 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -41,13 +41,16 @@
 #include "access/heapam_xlog.h"
 #include "access/htup_details.h"
 #include "access/multixact.h"
+#include "access/nbtree.h"
 #include "access/transam.h"
 #include "access/visibilitymap.h"
 #include "access/xlog.h"
+#include "catalog/index.h"
 #include "catalog/storage.h"
 #include "commands/dbcommands.h"
 #include "commands/progress.h"
 #include "commands/vacuum.h"
+#include "executor/executor.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
@@ -139,7 +142,6 @@ typedef struct LVRelStats
 	bool		lock_waiter_detected;
 } LVRelStats;
 
-
 /* A few variables that don't seem worth passing around as parameters */
 static int	elevel = -1;
 
@@ -156,6 +158,8 @@ static void lazy_scan_heap(Relation onerel, int options,
 			   bool aggressive);
 static void lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats);
 static bool lazy_check_needs_freeze(Buffer buf, bool *hastup);
+static void quick_vacuum_index(Relation irel, Relation hrel,
+					LVRelStats *vacrelstats);
 static void lazy_vacuum_index(Relation indrel,
 				  IndexBulkDeleteResult **stats,
 				  LVRelStats *vacrelstats);
@@ -734,10 +738,17 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 			/* Remove index entries */
 			for (i = 0; i < nindexes; i++)
-				lazy_vacuum_index(Irel[i],
+			{
+				bool use_quick_strategy = (vacrelstats->num_dead_tuples / vacrelstats->old_live_tuples < target_index_deletion_factor);
+
+				if (use_quick_strategy && (Irel[i]->rd_amroutine->amtargetdelete != NULL))
+					quick_vacuum_index(Irel[i], onerel, vacrelstats);
+				else
+					lazy_vacuum_index(Irel[i],
 								  &indstats[i],
 								  vacrelstats);
 
+			}
 			/*
 			 * Report that we are now vacuuming the heap.  We also increase
 			 * the number of index scans here; note that by using
@@ -1379,10 +1390,16 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 		/* Remove index entries */
 		for (i = 0; i < nindexes; i++)
-			lazy_vacuum_index(Irel[i],
+		{
+			bool use_quick_strategy = (vacrelstats->num_dead_tuples / vacrelstats->old_live_tuples < target_index_deletion_factor);
+
+			if (use_quick_strategy && (Irel[i]->rd_amroutine->amtargetdelete != NULL))
+				quick_vacuum_index(Irel[i], onerel, vacrelstats);
+			else
+				lazy_vacuum_index(Irel[i],
 							  &indstats[i],
 							  vacrelstats);
-
+		}
 		/* Report that we are now vacuuming the heap */
 		hvp_val[0] = PROGRESS_VACUUM_PHASE_VACUUM_HEAP;
 		hvp_val[1] = vacrelstats->num_index_scans + 1;
@@ -1671,6 +1688,142 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
 	return false;
 }
 
+/*
+ * Get tuple from heap for a scan key building.
+ */
+static HeapTuple
+get_tuple_by_tid(Relation rel, ItemPointer tid)
+{
+	Buffer			buffer;
+	Page			page;
+	OffsetNumber	offnum;
+	ItemId			lp;
+	HeapTuple		tuple;
+	bool			needLock = !RELATION_IS_LOCAL(rel);
+	BlockNumber		npages;
+
+	if (needLock)
+		LockRelationForExtension(rel, ExclusiveLock);
+	npages = RelationGetNumberOfBlocks(rel);
+	if (needLock)
+		UnlockRelationForExtension(rel, ExclusiveLock);
+	if (ItemPointerGetBlockNumber(tid) > npages)
+		return NULL;
+
+	buffer = ReadBufferExtended(rel, MAIN_FORKNUM, ItemPointerGetBlockNumber(tid), RBM_NORMAL, NULL);
+	LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+	page = (Page) BufferGetPage(buffer);
+	offnum = ItemPointerGetOffsetNumber(tid);
+	lp = PageGetItemId(page, offnum);
+
+	/*
+	 * VACUUM Races: someone already remove the tuple from HEAP. Ignore it.
+	 */
+	if (!ItemIdIsUsed(lp))
+	{
+		UnlockReleaseBuffer(buffer);
+		return NULL;
+	}
+	/* Walk along the chain */
+	while (!ItemIdHasStorage(lp))
+	{
+		offnum = ItemIdGetRedirect(lp);
+		lp = PageGetItemId(page, offnum);
+		Assert(ItemIdIsUsed(lp));
+	}
+
+	/* Form a tuple */
+	tuple = palloc(sizeof(HeapTupleData));
+	ItemPointerSet(&(tuple->t_self), BufferGetBlockNumber(buffer), offnum);
+	tuple->t_tableOid = RelationGetRelid(rel);
+	tuple->t_data = (HeapTupleHeader) PageGetItem(page, lp);
+	tuple->t_len = ItemIdGetLength(lp);
+	UnlockReleaseBuffer(buffer);
+	return tuple;
+}
+
+/*
+ *	quick_vacuum_index() -- quick vacuum one index relation.
+ *
+ *		Delete all the index entries pointing to tuples listed in
+ *		vacrelstats->dead_tuples.
+ */
+static void
+quick_vacuum_index(Relation irel, Relation hrel,
+				   LVRelStats *vacrelstats)
+{
+	int				tnum;
+	bool*			found = palloc0(vacrelstats->num_dead_tuples*sizeof(bool));
+	IndexInfo* 		indexInfo = BuildIndexInfo(irel);
+	EState*			estate = CreateExecutorState();
+	ExprContext*	econtext = GetPerTupleExprContext(estate);
+	ExprState*		predicate = ExecPrepareQual(indexInfo->ii_Predicate, estate);
+	TupleTableSlot*	slot = MakeSingleTupleTableSlot(RelationGetDescr(hrel));
+
+	IndexTargetDeleteResult	stats;
+	IndexTargetDeleteInfo	ivinfo;
+
+	ivinfo.indexRelation = irel;
+	ivinfo.heapRelation = hrel;
+
+	econtext->ecxt_scantuple = slot;
+
+	/* Get tuple from heap */
+	for (tnum = 0; tnum < vacrelstats->num_dead_tuples; tnum++)
+	{
+		HeapTuple		tuple;
+		Datum			values[INDEX_MAX_KEYS];
+		bool			isnull[INDEX_MAX_KEYS];
+
+		/* Index entry for the TID was deleted early */
+		if (found[tnum])
+			continue;
+
+		/* Get a tuple from heap */
+		if ((tuple = get_tuple_by_tid(hrel, &(vacrelstats->dead_tuples[tnum]))) == NULL)
+		{
+			/*
+			 * Tuple has 'not used' status.
+			 */
+			found[tnum] = true;
+			continue;
+		}
+
+		/*
+		 * Form values[] and isnull[] arrays from for index tuple
+		 * by heap tuple
+		 */
+		MemoryContextReset(econtext->ecxt_per_tuple_memory);
+
+		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
+
+		/*
+		 * In a partial index, ignore tuples that don't satisfy the
+		 * predicate.
+		 */
+		if ((predicate != NULL) && (!ExecQual(predicate, econtext)))
+		{
+			found[tnum] = true;
+			continue;
+		}
+
+		FormIndexDatum(indexInfo, slot, estate, values, isnull);
+
+		/*
+		 * Make attempt to delete some index entries by one tree descent.
+		 * We use only a part of TID list, which contains not found TID's.
+		 */
+		ivinfo.dead_tuples = &(vacrelstats->dead_tuples[tnum]);
+		ivinfo.num_dead_tuples = vacrelstats->num_dead_tuples - tnum;
+		ivinfo.found_dead_tuples = found + tnum;
+		index_target_delete(&ivinfo, &stats, values, isnull);
+	}
+
+	pfree(found);
+	ExecDropSingleTupleTableSlot(slot);
+	FreeExecutorState(estate);
+}
 
 /*
  *	lazy_vacuum_index() -- vacuum one index relation.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index fa3c8a7..3f6c83b 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3236,7 +3236,15 @@ static struct config_real ConfigureNamesReal[] =
 		0.1, 0.0, 100.0,
 		NULL, NULL, NULL
 	},
-
+	{
+		{"target_index_deletion_factor", PGC_SIGHUP, AUTOVACUUM,
+			gettext_noop("Maximum number of vacuumed tuples as a fraction of reltuples where we can use target index vacuum strategy."),
+			NULL
+		},
+		&target_index_deletion_factor,
+		0.01, 0.0, 1.0,
+		NULL, NULL, NULL
+	},
 	{
 		{"checkpoint_completion_target", PGC_SIGHUP, WAL_CHECKPOINTS,
 			gettext_noop("Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval."),
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ca5cad7..c5a7bc8 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -100,6 +100,7 @@ extern Relation heap_openrv_extended(const RangeVar *relation,
 typedef struct HeapScanDescData *HeapScanDesc;
 typedef struct ParallelHeapScanDescData *ParallelHeapScanDesc;
 
+extern double target_index_deletion_factor;
 /*
  * HeapScanIsValid
  *		True iff the heap scan is valid.
-- 
2.7.4

>From 612a97598317e81d1087413d1cd10f3c5b6cc44a Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepik...@postgrespro.ru>
Date: Mon, 2 Jul 2018 16:26:56 +0500
Subject: [PATCH 3/4] Ensure-nbtree-leaf-tuple-keys-are-always-unique

---
 contrib/amcheck/verify_nbtree.c                    | 219 +++++++++++++++------
 contrib/earthdistance/expected/earthdistance.out   |   2 +-
 contrib/file_fdw/output/file_fdw.source            |  10 +-
 src/backend/access/nbtree/README                   |  70 ++++---
 src/backend/access/nbtree/nbtinsert.c              | 133 ++++++++-----
 src/backend/access/nbtree/nbtpage.c                |   6 +-
 src/backend/access/nbtree/nbtsearch.c              |  96 +++++++--
 src/backend/access/nbtree/nbtsort.c                |  53 +++--
 src/backend/access/nbtree/nbtutils.c               | 155 ++++++++++++---
 src/backend/access/nbtree/nbtxlog.c                |   3 +
 src/backend/utils/sort/tuplesort.c                 |   7 +-
 src/include/access/nbtree.h                        |  71 +++++--
 .../test_extensions/expected/test_extensions.out   |   6 +-
 src/test/regress/expected/aggregates.out           |   4 +-
 src/test/regress/expected/alter_table.out          |  20 +-
 src/test/regress/expected/collate.out              |   2 +-
 src/test/regress/expected/create_type.out          |   8 +-
 src/test/regress/expected/create_view.out          |   2 +-
 src/test/regress/expected/dependency.out           |   4 +-
 src/test/regress/expected/domain.out               |   4 +-
 src/test/regress/expected/event_trigger.out        |  75 ++++---
 src/test/regress/expected/foreign_data.out         |   8 +-
 src/test/regress/expected/foreign_key.out          |   2 +-
 src/test/regress/expected/indexing.out             |  12 +-
 src/test/regress/expected/inherit.out              |  16 +-
 src/test/regress/expected/matview.out              |  18 +-
 src/test/regress/expected/rowsecurity.out          |   4 +-
 src/test/regress/expected/select_into.out          |   4 +-
 src/test/regress/expected/triggers.out             |  16 +-
 src/test/regress/expected/truncate.out             |   4 +-
 src/test/regress/expected/typed_table.out          |  12 +-
 src/test/regress/expected/updatable_views.out      |  28 +--
 src/test/regress/output/tablespace.source          |   8 +-
 src/test/regress/sql/updatable_views.sql           |   2 +
 34 files changed, 731 insertions(+), 353 deletions(-)

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index a1438a2..2358bfa 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -46,6 +46,13 @@ PG_MODULE_MAGIC;
 #define InvalidBtreeLevel	((uint32) InvalidBlockNumber)
 
 /*
+ * Convenience macro to get number of key attributes in tuple in low-context
+ * fashion
+ */
+#define BTreeTupleGetNKeyAtts(itup, rel)   \
+	Min(IndexRelationGetNumberOfKeyAttributes(rel), BTreeTupleGetNAtts(itup, rel))
+
+/*
  * State associated with verifying a B-Tree index
  *
  * target is the point of reference for a verification operation.
@@ -125,25 +132,27 @@ static void bt_check_every_level(Relation rel, Relation heaprel,
 static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state,
 							 BtreeLevel level);
 static void bt_target_page_check(BtreeCheckState *state);
-static ScanKey bt_right_page_check_scankey(BtreeCheckState *state);
+static IndexTuple bt_right_page_check_tuple(BtreeCheckState *state);
 static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey);
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts);
 static void bt_downlink_missing_check(BtreeCheckState *state);
 static void bt_tuple_present_callback(Relation index, HeapTuple htup,
 						  Datum *values, bool *isnull,
 						  bool tupleIsAlive, void *checkstate);
 static inline bool offset_is_negative_infinity(BTPageOpaque opaque,
 							OffsetNumber offset);
+static inline bool invariant_l_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+					 OffsetNumber upperbound);
 static inline bool invariant_leq_offset(BtreeCheckState *state,
-					 ScanKey key,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber upperbound);
-static inline bool invariant_geq_offset(BtreeCheckState *state,
-					 ScanKey key,
+static inline bool invariant_g_offset(BtreeCheckState *state,
+					 int tupnkeyatts, ScanKey key, ItemPointer scantid,
 					 OffsetNumber lowerbound);
-static inline bool invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page other,
-							   ScanKey key,
-							   OffsetNumber upperbound);
+static inline bool invariant_l_nontarget_offset(BtreeCheckState *state,
+							   Page other, int tupnkeyatts, ScanKey key,
+							   ItemPointer scantid, OffsetNumber upperbound);
 static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum);
 
 /*
@@ -834,8 +843,10 @@ bt_target_page_check(BtreeCheckState *state)
 	{
 		ItemId		itemid;
 		IndexTuple	itup;
-		ScanKey		skey;
 		size_t		tupsize;
+		int			tupnkeyatts;
+		ScanKey		skey;
+		ItemPointer scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -902,8 +913,17 @@ bt_target_page_check(BtreeCheckState *state)
 		if (offset_is_negative_infinity(topaque, offset))
 			continue;
 
-		/* Build insertion scankey for current page offset */
+		/*
+		 * Build insertion scankey for current page offset/tuple.
+		 *
+		 * As required by _bt_mkscankey(), track number of key attributes,
+		 * which is needed so that _bt_compare() calls handle truncated
+		 * attributes correctly.  Never count non-key attributes in
+		 * non-truncated tuples as key attributes, though.
+		 */
+		tupnkeyatts = BTreeTupleGetNKeyAtts(itup, state->rel);
 		skey = _bt_mkscankey(state->rel, itup);
+		scantid = BTreeTupleGetHeapTID(itup);
 
 		/* Fingerprint leaf page tuples (those that point to the heap) */
 		if (state->heapallindexed && P_ISLEAF(topaque) && !ItemIdIsDead(itemid))
@@ -930,7 +950,7 @@ bt_target_page_check(BtreeCheckState *state)
 		 * and probably not markedly more effective in practice.
 		 */
 		if (!P_RIGHTMOST(topaque) &&
-			!invariant_leq_offset(state, skey, P_HIKEY))
+			!invariant_leq_offset(state, tupnkeyatts, skey, scantid, P_HIKEY))
 		{
 			char	   *itid,
 					   *htid;
@@ -956,11 +976,11 @@ bt_target_page_check(BtreeCheckState *state)
 		 * * Item order check *
 		 *
 		 * Check that items are stored on page in logical order, by checking
-		 * current item is less than or equal to next item (if any).
+		 * current item is strictly less than next item (if any).
 		 */
 		if (OffsetNumberNext(offset) <= max &&
-			!invariant_leq_offset(state, skey,
-								  OffsetNumberNext(offset)))
+			!invariant_l_offset(state, tupnkeyatts, skey, scantid,
+								OffsetNumberNext(offset)))
 		{
 			char	   *itid,
 					   *htid,
@@ -1017,16 +1037,28 @@ bt_target_page_check(BtreeCheckState *state)
 		 */
 		else if (offset == max)
 		{
+			IndexTuple	righttup;
 			ScanKey		rightkey;
+			int			righttupnkeyatts;
+			ItemPointer rightscantid;
 
 			/* Get item in next/right page */
-			rightkey = bt_right_page_check_scankey(state);
+			righttup = bt_right_page_check_tuple(state);
+
+			/* Set up right item scankey */
+			if (righttup)
+			{
+				righttupnkeyatts = BTreeTupleGetNKeyAtts(righttup, state->rel);
+				rightkey = _bt_mkscankey(state->rel, righttup);
+				rightscantid = BTreeTupleGetHeapTID(righttup);
+			}
 
-			if (rightkey &&
-				!invariant_geq_offset(state, rightkey, max))
+			if (righttup &&
+				!invariant_g_offset(state, righttupnkeyatts, rightkey,
+									rightscantid, max))
 			{
 				/*
-				 * As explained at length in bt_right_page_check_scankey(),
+				 * As explained at length in bt_right_page_check_tuple(),
 				 * there is a known !readonly race that could account for
 				 * apparent violation of invariant, which we must check for
 				 * before actually proceeding with raising error.  Our canary
@@ -1069,7 +1101,7 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			BlockNumber childblock = BTreeInnerTupleGetDownLink(itup);
 
-			bt_downlink_check(state, childblock, skey);
+			bt_downlink_check(state, childblock, skey, scantid, tupnkeyatts);
 		}
 	}
 
@@ -1083,9 +1115,9 @@ bt_target_page_check(BtreeCheckState *state)
 }
 
 /*
- * Return a scankey for an item on page to right of current target (or the
+ * Return an index tuple for an item on page to right of current target (or the
  * first non-ignorable page), sufficient to check ordering invariant on last
- * item in current target page.  Returned scankey relies on local memory
+ * item in current target page.  Returned tuple relies on local memory
  * allocated for the child page, which caller cannot pfree().  Caller's memory
  * context should be reset between calls here.
  *
@@ -1098,8 +1130,8 @@ bt_target_page_check(BtreeCheckState *state)
  * Note that !readonly callers must reverify that target page has not
  * been concurrently deleted.
  */
-static ScanKey
-bt_right_page_check_scankey(BtreeCheckState *state)
+static IndexTuple
+bt_right_page_check_tuple(BtreeCheckState *state)
 {
 	BTPageOpaque opaque;
 	ItemId		rightitem;
@@ -1287,11 +1319,10 @@ bt_right_page_check_scankey(BtreeCheckState *state)
 	}
 
 	/*
-	 * Return first real item scankey.  Note that this relies on right page
-	 * memory remaining allocated.
+	 * Return first real item.  Note that this relies on right page memory
+	 * remaining allocated.
 	 */
-	return _bt_mkscankey(state->rel,
-						 (IndexTuple) PageGetItem(rightpage, rightitem));
+	return (IndexTuple) PageGetItem(rightpage, rightitem);
 }
 
 /*
@@ -1305,7 +1336,7 @@ bt_right_page_check_scankey(BtreeCheckState *state)
  */
 static void
 bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
-				  ScanKey targetkey)
+				  ScanKey targetkey, ItemPointer scantid, int tupnkeyatts)
 {
 	OffsetNumber offset;
 	OffsetNumber maxoffset;
@@ -1354,7 +1385,7 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 
 	/*
 	 * Verify child page has the downlink key from target page (its parent) as
-	 * a lower bound.
+	 * a lower bound; downlink must be strictly less than all keys on the page.
 	 *
 	 * Check all items, rather than checking just the first and trusting that
 	 * the operator class obeys the transitive law.
@@ -1404,14 +1435,14 @@ bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
 		/*
 		 * Skip comparison of target page key against "negative infinity"
 		 * item, if any.  Checking it would indicate that it's not an upper
-		 * bound, but that's only because of the hard-coding within
-		 * _bt_compare().
+		 * bound, but that's only because of the hard-coding for negative
+		 * inifinity items within _bt_compare().
 		 */
 		if (offset_is_negative_infinity(copaque, offset))
 			continue;
 
-		if (!invariant_leq_nontarget_offset(state, child,
-											targetkey, offset))
+		if (!invariant_l_nontarget_offset(state, child, tupnkeyatts, targetkey,
+										  scantid, offset))
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("down-link lower bound invariant violated for index \"%s\"",
@@ -1752,6 +1783,51 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
 }
 
 /*
+ * Does the invariant hold that the key is strictly less than a given upper
+ * bound offset item?
+ *
+ * If this function returns false, convention is that caller throws error due
+ * to corruption.
+ */
+static inline bool
+invariant_l_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber upperbound)
+{
+	int32		cmp;
+
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
+
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	ritup;
+		int			uppnkeyatts;
+		ItemPointer rheaptid;
+
+		itemid = PageGetItemId(state->target, upperbound);
+		ritup = (IndexTuple) PageGetItem(state->target, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(ritup, state->rel);
+
+		/* Get heap TID for item to the right */
+		rheaptid = BTreeTupleGetHeapTID(ritup);
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && rheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
+}
+
+/*
  * Does the invariant hold that the key is less than or equal to a given upper
  * bound offset item?
  *
@@ -1759,57 +1835,90 @@ offset_is_negative_infinity(BTPageOpaque opaque, OffsetNumber offset)
  * to corruption.
  */
 static inline bool
-invariant_leq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber upperbound)
+invariant_leq_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+					 ItemPointer scantid, OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  upperbound);
 
 	return cmp <= 0;
 }
 
 /*
- * Does the invariant hold that the key is greater than or equal to a given
- * lower bound offset item?
+ * Does the invariant hold that the key is strictly greater than a given lower
+ * bound offset item?
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_geq_offset(BtreeCheckState *state, ScanKey key,
-					 OffsetNumber lowerbound)
+invariant_g_offset(BtreeCheckState *state, int tupnkeyatts, ScanKey key,
+				   ItemPointer scantid, OffsetNumber lowerbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, state->target, lowerbound);
+	/*
+	 * No need to consider possibility that scankey has attributes that we need
+	 * to force to be interpreted as negative infinity, since scan key has to
+	 * be strictly greater than lower bound offset.
+	 */
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, state->target,
+					  lowerbound);
 
-	return cmp >= 0;
+	return cmp > 0;
 }
 
 /*
- * Does the invariant hold that the key is less than or equal to a given upper
+ * Does the invariant hold that the key is strictly less than a given upper
  * bound offset item, with the offset relating to a caller-supplied page that
- * is not the current target page? Caller's non-target page is typically a
- * child page of the target, checked as part of checking a property of the
- * target page (i.e. the key comes from the target).
+ * is not the current target page?
+ *
+ * Caller's non-target page is a child page of the target, checked as part of
+ * checking a property of the target page (i.e.  the key comes from the
+ * target).
  *
  * If this function returns false, convention is that caller throws error due
  * to corruption.
  */
 static inline bool
-invariant_leq_nontarget_offset(BtreeCheckState *state,
-							   Page nontarget, ScanKey key,
-							   OffsetNumber upperbound)
+invariant_l_nontarget_offset(BtreeCheckState *state, Page nontarget,
+							 int tupnkeyatts, ScanKey key, ItemPointer scantid,
+							 OffsetNumber upperbound)
 {
-	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(state->rel);
 	int32		cmp;
 
-	cmp = _bt_compare(state->rel, nkeyatts, key, nontarget, upperbound);
+	cmp = _bt_compare(state->rel, tupnkeyatts, key, scantid, nontarget,
+					  upperbound);
 
-	return cmp <= 0;
+	/*
+	 * _bt_compare interprets the absence of attributes in scan keys as meaning
+	 * that they're not participating in a search, not as negative infinity
+	 * (only tuples within the index are treated as negative infinity).
+	 * Compensate for that here.
+	 */
+	if (cmp == 0)
+	{
+		ItemId		itemid;
+		IndexTuple	child;
+		int			uppnkeyatts;
+		ItemPointer childheaptid;
+
+		itemid = PageGetItemId(nontarget, upperbound);
+		child = (IndexTuple) PageGetItem(nontarget, itemid);
+		uppnkeyatts = BTreeTupleGetNKeyAtts(child, state->rel);
+
+		/* Get heap TID for item from child/non-target */
+		childheaptid = BTreeTupleGetHeapTID(child);
+
+		if (uppnkeyatts == tupnkeyatts)
+			return scantid == NULL && childheaptid != NULL;
+
+		return tupnkeyatts < uppnkeyatts;
+	}
+
+	return cmp < 0;
 }
 
 /*
diff --git a/contrib/earthdistance/expected/earthdistance.out b/contrib/earthdistance/expected/earthdistance.out
index 36a5a7b..0ce34da 100644
--- a/contrib/earthdistance/expected/earthdistance.out
+++ b/contrib/earthdistance/expected/earthdistance.out
@@ -968,7 +968,7 @@ SELECT abs(cube_distance(ll_to_earth(-30,-90), '(0)'::cube) / earth() - 1) <
 
 drop extension cube;  -- fail, earthdistance requires it
 ERROR:  cannot drop extension cube because other objects depend on it
-DETAIL:  extension earthdistance depends on extension cube
+DETAIL:  extension earthdistance depends on function cube_out(cube)
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 drop extension earthdistance;
 drop type cube;  -- fail, extension cube requires it
diff --git a/contrib/file_fdw/output/file_fdw.source b/contrib/file_fdw/output/file_fdw.source
index 853c9f9..42bf16b 100644
--- a/contrib/file_fdw/output/file_fdw.source
+++ b/contrib/file_fdw/output/file_fdw.source
@@ -432,10 +432,10 @@ RESET ROLE;
 DROP EXTENSION file_fdw CASCADE;
 NOTICE:  drop cascades to 7 other objects
 DETAIL:  drop cascades to server file_server
-drop cascades to user mapping for regress_file_fdw_superuser on server file_server
-drop cascades to user mapping for regress_no_priv_user on server file_server
-drop cascades to foreign table agg_text
-drop cascades to foreign table agg_csv
-drop cascades to foreign table agg_bad
 drop cascades to foreign table text_csv
+drop cascades to foreign table agg_bad
+drop cascades to foreign table agg_csv
+drop cascades to foreign table agg_text
+drop cascades to user mapping for regress_no_priv_user on server file_server
+drop cascades to user mapping for regress_file_fdw_superuser on server file_server
 DROP ROLE regress_file_fdw_superuser, regress_file_fdw_user, regress_no_priv_user;
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 3680e69..0782f01 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -46,18 +46,15 @@ the real "key" at all, just at the link field.)  We can distinguish
 items at the leaf level in the same way, by examining their links to
 heap tuples; we'd never have two items for the same heap tuple.
 
-Lehman and Yao assume that the key range for a subtree S is described
+Lehman and Yao require that the key range for a subtree S is described
 by Ki < v <= Ki+1 where Ki and Ki+1 are the adjacent keys in the parent
-page.  This does not work for nonunique keys (for example, if we have
-enough equal keys to spread across several leaf pages, there *must* be
-some equal bounding keys in the first level up).  Therefore we assume
-Ki <= v <= Ki+1 instead.  A search that finds exact equality to a
-bounding key in an upper tree level must descend to the left of that
-key to ensure it finds any equal keys in the preceding page.  An
-insertion that sees the high key of its target page is equal to the key
-to be inserted has a choice whether or not to move right, since the new
-key could go on either page.  (Currently, we try to find a page where
-there is room for the new key without a split.)
+page.  A search that finds exact equality to a bounding key in an upper
+tree level must descend to the left of that key to ensure it finds any
+equal keys.  An insertion that sees the high key of its target page is
+equal to the key to be inserted cannot move right, since the downlink
+for the right sibling in the parent must always be strictly less than
+right sibling keys (this is always possible because the leftmost
+downlink on any non-leaf level is always a negative infinity downlink).
 
 Lehman and Yao don't require read locks, but assume that in-memory
 copies of tree pages are unshared.  Postgres shares in-memory buffers
@@ -610,21 +607,25 @@ scanned to decide whether to return the entry and whether the scan can
 stop (see _bt_checkkeys()).
 
 We use term "pivot" index tuples to distinguish tuples which don't point
-to heap tuples, but rather used for tree navigation.  Pivot tuples includes
-all tuples on non-leaf pages and high keys on leaf pages.  Note that pivot
-index tuples are only used to represent which part of the key space belongs
-on each page, and can have attribute values copied from non-pivot tuples
-that were deleted and killed by VACUUM some time ago.  In principle, we could
-truncate away attributes that are not needed for a page high key during a leaf
-page split, provided that the remaining attributes distinguish the last index
-tuple on the post-split left page as belonging on the left page, and the first
-index tuple on the post-split right page as belonging on the right page.  This
-optimization is sometimes called suffix truncation, and may appear in a future
-release. Since the high key is subsequently reused as the downlink in the
-parent page for the new right page, suffix truncation can increase index
-fan-out considerably by keeping pivot tuples short.  INCLUDE indexes similarly
-truncate away non-key attributes at the time of a leaf page split,
-increasing fan-out.
+to heap tuples, that are used only for tree navigation.  Pivot tuples
+includes all tuples on non-leaf pages and high keys on leaf pages.  Note
+that pivot index tuples are only used to represent which part of the key
+space belongs on each page, and can have attribute values copied from
+non-pivot tuples that were deleted and killed by VACUUM some time ago.
+
+We truncate away attributes that are not needed for a page high key during
+a leaf page split, provided that the remaining attributes distinguish the
+last index tuple on the post-split left page as belonging on the left
+page, and the first index tuple on the post-split right page as belonging
+on the right page.  A truncated tuple logically retains the truncated
+suffix key attributes, which implicitly have "negative infinity" as their
+value.  This optimization is called suffix truncation.  Since the high key
+is subsequently reused as the downlink in the parent page for the new
+right page, suffix truncation can increase index fan-out considerably by
+keeping pivot tuples short.  INCLUDE indexes are guaranteed to have
+non-key attributes truncated at the time of a leaf page split, but may
+also have some key attributes truncated away, based on the usual criteria
+for key attributes.
 
 Notes About Data Representation
 -------------------------------
@@ -658,4 +659,19 @@ downlink.  The first data item on each such page has no lower bound
 routines must treat it accordingly.  The actual key stored in the
 item is irrelevant, and need not be stored at all.  This arrangement
 corresponds to the fact that an L&Y non-leaf page has one more pointer
-than key.
+than key.  Suffix truncation's negative infinity attributes behave in
+the same way.
+
+Non-leaf pages only truly need to truncate their first item to zero
+attributes at the leftmost level, since that truly is negative infinity.
+All other negative infinity items are only really negative infinity
+within the subtree that the page is at the root of (or is a leftmost
+page within).  We truncate away all attributes of the first item on
+non-leaf pages just the same, to save a little space.  If we ever
+avoided zero-truncating items on pages where that doesn't accurately
+represent the absolute separation of the keyspace, we'd be left with
+"low key" items on internal pages -- a key value that can be used as a
+lower bound on items on the page, much like the high key is an upper
+bound. (Actually, that would even be true of "true" negative infinity
+items.  One can think of rightmost pages as implicitly containing
+"positive infinity" high keys.)
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index 907cce0..4c4f7d8 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -180,7 +180,7 @@ top:
 				!P_IGNORE(lpageop) &&
 				(PageGetFreeSpace(page) > itemsz) &&
 				PageGetMaxOffsetNumber(page) >= P_FIRSTDATAKEY(lpageop) &&
-				_bt_compare(rel, indnkeyatts, itup_scankey, page,
+				_bt_compare(rel, indnkeyatts, itup_scankey, &itup->t_tid, page,
 							P_FIRSTDATAKEY(lpageop)) > 0)
 			{
 				/*
@@ -216,9 +216,12 @@ top:
 
 	if (!fastpath)
 	{
+		ItemPointer scantid =
+			(checkUnique != UNIQUE_CHECK_NO ? NULL : &itup->t_tid);
+
 		/* find the first page containing this key */
-		stack = _bt_search(rel, indnkeyatts, itup_scankey, false, &buf, BT_WRITE,
-						   NULL);
+		stack = _bt_search(rel, indnkeyatts, itup_scankey, scantid, false,
+						   &buf, BT_WRITE, NULL);
 
 		/* trade in our read lock for a write lock */
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
@@ -231,8 +234,8 @@ top:
 		 * need to move right in the tree.  See Lehman and Yao for an
 		 * excruciatingly precise description.
 		 */
-		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, false,
-							true, stack, BT_WRITE, NULL);
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, scantid,
+							false, true, stack, BT_WRITE, NULL);
 	}
 
 	/*
@@ -261,7 +264,8 @@ top:
 		TransactionId xwait;
 		uint32		speculativeToken;
 
-		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, false);
+		/* Find position while excluding heap TID attribute */
+		offset = _bt_binsrch(rel, buf, indnkeyatts, itup_scankey, NULL, false);
 		xwait = _bt_check_unique(rel, itup, heapRel, buf, offset, itup_scankey,
 								 checkUnique, &is_unique, &speculativeToken);
 
@@ -285,6 +289,25 @@ top:
 				_bt_freestack(stack);
 			goto top;
 		}
+
+		/*
+		 * Be careful to not get confused about user attribute position and
+		 * insertion position.
+		 *
+		 * XXX: This is ugly as sin, and clearly needs a lot more work.  While
+		 * not having this code does not seem to affect regression tests, we
+		 * almost certainly need to do something here for the case where
+		 * _bt_check_unique() traverses many pages, each filled with logical
+		 * duplicates.
+		 */
+		buf = _bt_moveright(rel, buf, indnkeyatts, itup_scankey, &itup->t_tid,
+							false, true, stack, BT_WRITE, NULL);
+		/*
+		 * Always invalidate hint
+		 *
+		 * FIXME: This is unacceptable.
+		 */
+		offset = InvalidOffsetNumber;
 	}
 
 	if (checkUnique != UNIQUE_CHECK_EXISTING)
@@ -564,11 +587,11 @@ _bt_check_unique(Relation rel, IndexTuple itup, Relation heapRel,
 			offset = OffsetNumberNext(offset);
 		else
 		{
-			/* If scankey == hikey we gotta check the next page too */
+			/* If scankey <= hikey we gotta check the next page too */
 			if (P_RIGHTMOST(opaque))
 				break;
-			if (!_bt_isequal(itupdesc, page, P_HIKEY,
-							 indnkeyatts, itup_scankey))
+			/* _bt_isequal()'s special NULL semantics not required here */
+			if (_bt_compare(rel, indnkeyatts, itup_scankey, NULL, page, P_HIKEY) > 0)
 				break;
 			/* Advance to next non-dead page --- there must be one */
 			for (;;)
@@ -700,6 +723,18 @@ _bt_findinsertloc(Relation rel,
 	 * pages).  Currently the probability of moving right is set at 0.99,
 	 * which may seem too high to change the behavior much, but it does an
 	 * excellent job of preventing O(N^2) behavior with many equal keys.
+	 *
+	 * TODO: Support this old approach for pre-pg_upgrade indexes.
+	 *
+	 * None of this applies when all items in the tree are unique, since the
+	 * new item cannot go on either page if it's equal to the high key.  The
+	 * original L&Y invariant that we now follow is that high keys must be
+	 * less than or equal to all items on the page, and strictly less than
+	 * the right sibling items (since the high key also becomes the downlink
+	 * to the right sibling in parent after a page split).  It's very
+	 * unlikely that it will be equal anyway, since there will be explicit
+	 * heap TIDs in pivot tuples in the event of many duplicates, but it can
+	 * happen when heap TID recycling takes place.
 	 *----------
 	 */
 	movedright = false;
@@ -731,8 +766,7 @@ _bt_findinsertloc(Relation rel,
 		 * nope, so check conditions (b) and (c) enumerated above
 		 */
 		if (P_RIGHTMOST(lpageop) ||
-			_bt_compare(rel, keysz, scankey, page, P_HIKEY) != 0 ||
-			random() <= (MAX_RANDOM_VALUE / 100))
+			_bt_compare(rel, keysz, scankey, &newtup->t_tid, page, P_HIKEY) <= 0)
 			break;
 
 		/*
@@ -792,7 +826,7 @@ _bt_findinsertloc(Relation rel,
 	else if (firstlegaloff != InvalidOffsetNumber && !vacuumed)
 		newitemoff = firstlegaloff;
 	else
-		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, false);
+		newitemoff = _bt_binsrch(rel, buf, keysz, scankey, &newtup->t_tid, false);
 
 	*bufptr = buf;
 	*offsetptr = newitemoff;
@@ -851,11 +885,12 @@ _bt_insertonpg(Relation rel,
 	/* child buffer must be given iff inserting on an internal page */
 	Assert(P_ISLEAF(lpageop) == !BufferIsValid(cbuf));
 	/* tuple must have appropriate number of attributes */
+	Assert(BTreeTupleGetNAtts(itup, rel) > 0);
 	Assert(!P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) ==
 		   IndexRelationGetNumberOfAttributes(rel));
 	Assert(P_ISLEAF(lpageop) ||
-		   BTreeTupleGetNAtts(itup, rel) ==
+		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 
 	/* The caller should've finished any incomplete splits already. */
@@ -1143,8 +1178,6 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	OffsetNumber i;
 	bool		isleaf;
 	IndexTuple	lefthikey;
-	int			indnatts = IndexRelationGetNumberOfAttributes(rel);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/* Acquire a new page to split into */
 	rbuf = _bt_getbuf(rel, P_NEW, BT_WRITE);
@@ -1214,7 +1247,9 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		itemid = PageGetItemId(origpage, P_HIKEY);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
-		Assert(BTreeTupleGetNAtts(item, rel) == indnkeyatts);
+		Assert(BTreeTupleGetNAtts(item, rel) > 0);
+		Assert(BTreeTupleGetNAtts(item, rel) <=
+			   IndexRelationGetNumberOfKeyAttributes(rel));
 		if (PageAddItem(rightpage, (Item) item, itemsz, rightoff,
 						false, false) == InvalidOffsetNumber)
 		{
@@ -1247,25 +1282,35 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 	}
 
 	/*
-	 * Truncate non-key (INCLUDE) attributes of the high key item before
-	 * inserting it on the left page.  This only needs to happen at the leaf
-	 * level, since in general all pivot tuple values originate from leaf
-	 * level high keys.  This isn't just about avoiding unnecessary work,
-	 * though; truncating unneeded key attributes (more aggressive suffix
-	 * truncation) can only be performed at the leaf level anyway.  This is
-	 * because a pivot tuple in a grandparent page must guide a search not
-	 * only to the correct parent page, but also to the correct leaf page.
+	 * Truncate attributes of the high key item before inserting it on the left
+	 * page.  This can only happen at the leaf level, since in general all
+	 * pivot tuple values originate from leaf level high keys.  This isn't just
+	 * about avoiding unnecessary work, though; truncating unneeded key suffix
+	 * attributes can only be performed at the leaf level anyway.  This is
+	 * because a pivot tuple in a grandparent page must guide a search not only
+	 * to the correct parent page, but also to the correct leaf page.
+	 *
+	 * Note that non-key (INCLUDE) attributes are always truncated away here.
+	 * Additional key attributes are truncated away when they're not required
+	 * to correctly separate the key space.
+	 *
+	 * TODO: Give a little weight to how large the final downlink will be when
+	 * deciding on a split point.
 	 */
-	if (indnatts != indnkeyatts && isleaf)
+	if (isleaf)
 	{
-		lefthikey = _bt_nonkey_truncate(rel, item);
+		OffsetNumber	lastleftoffnum = OffsetNumberPrev(firstright);
+
+		lefthikey = _bt_suffix_truncate(rel, origpage, lastleftoffnum , item);
 		itemsz = IndexTupleSize(lefthikey);
 		itemsz = MAXALIGN(itemsz);
 	}
 	else
 		lefthikey = item;
 
-	Assert(BTreeTupleGetNAtts(lefthikey, rel) == indnkeyatts);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) > 0);
+	Assert(BTreeTupleGetNAtts(lefthikey, rel) <=
+		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(leftpage, (Item) lefthikey, itemsz, leftoff,
 					false, false) == InvalidOffsetNumber)
 	{
@@ -1487,22 +1532,11 @@ _bt_split(Relation rel, Buffer buf, Buffer cbuf, OffsetNumber firstright,
 		if (newitemonleft)
 			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
 
-		/* Log left page */
-		if (!isleaf || indnatts != indnkeyatts)
-		{
-			/*
-			 * We must also log the left page's high key.  There are two
-			 * reasons for that: right page's leftmost key is suppressed on
-			 * non-leaf levels and in covering indexes included columns are
-			 * truncated from high keys.  Show it as belonging to the left
-			 * page buffer, so that it is not stored if XLogInsert decides it
-			 * needs a full-page image of the left page.
-			 */
-			itemid = PageGetItemId(origpage, P_HIKEY);
-			item = (IndexTuple) PageGetItem(origpage, itemid);
-			XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
-			loglhikey = true;
-		}
+		/* Log left page.  We must also log the left page's high key. */
+		itemid = PageGetItemId(origpage, P_HIKEY);
+		item = (IndexTuple) PageGetItem(origpage, itemid);
+		XLogRegisterBufData(0, (char *) item, MAXALIGN(IndexTupleSize(item)));
+		loglhikey = true;
 
 		/*
 		 * Log the contents of the right page in the format understood by
@@ -2210,7 +2244,8 @@ _bt_newroot(Relation rel, Buffer lbuf, Buffer rbuf)
 	/*
 	 * insert the right page pointer into the new root page.
 	 */
-	Assert(BTreeTupleGetNAtts(right_item, rel) ==
+	Assert(BTreeTupleGetNAtts(right_item, rel) > 0);
+	Assert(BTreeTupleGetNAtts(right_item, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
 	if (PageAddItem(rootpage, (Item) right_item, right_item_sz, P_FIRSTKEY,
 					false, false) == InvalidOffsetNumber)
@@ -2322,8 +2357,8 @@ _bt_pgaddtup(Page page,
 /*
  * _bt_isequal - used in _bt_doinsert in check for duplicates.
  *
- * This is very similar to _bt_compare, except for NULL handling.
- * Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
+ * This is very similar to _bt_compare, except for NULL and negative infinity
+ * handling.  Rule is simple: NOT_NULL not equal NULL, NULL not equal NULL too.
  */
 static bool
 _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
@@ -2337,12 +2372,6 @@ _bt_isequal(TupleDesc itupdesc, Page page, OffsetNumber offnum,
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 
-	/*
-	 * It's okay that we might perform a comparison against a truncated page
-	 * high key when caller needs to determine if _bt_check_unique scan must
-	 * continue on to the next page.  Caller never asks us to compare non-key
-	 * attributes within an INCLUDE index.
-	 */
 	for (i = 1; i <= keysz; i++)
 	{
 		AttrNumber	attno;
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index a24e641..25b24b1 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -1415,8 +1415,10 @@ _bt_pagedel(Relation rel, Buffer buf)
 				itup_scankey = _bt_mkscankey(rel, targetkey);
 				/* find the leftmost leaf page containing this key */
 				stack = _bt_search(rel,
-								   IndexRelationGetNumberOfKeyAttributes(rel),
-								   itup_scankey, false, &lbuf, BT_READ, NULL);
+								   BTreeTupleGetNAtts(targetkey, rel),
+								   itup_scankey,
+								   BTreeTupleGetHeapTID(targetkey), false,
+								   &lbuf, BT_READ, NULL);
 				/* don't need a pin on the page */
 				_bt_relbuf(rel, lbuf);
 
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 0bcfa10..8b70f34 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -94,8 +94,8 @@ _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp)
  * any incomplete splits encountered during the search will be finished.
  */
 BTStack
-_bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
-		   Buffer *bufP, int access, Snapshot snapshot)
+_bt_search(Relation rel, int keysz, ScanKey scankey, ItemPointer scantid,
+		   bool nextkey, Buffer *bufP, int access, Snapshot snapshot)
 {
 	BTStack		stack_in = NULL;
 
@@ -130,7 +130,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * if the leaf page is split and we insert to the parent page).  But
 		 * this is a good opportunity to finish splits of internal pages too.
 		 */
-		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, nextkey,
+		*bufP = _bt_moveright(rel, *bufP, keysz, scankey, scantid, nextkey,
 							  (access == BT_WRITE), stack_in,
 							  BT_READ, snapshot);
 
@@ -144,7 +144,7 @@ _bt_search(Relation rel, int keysz, ScanKey scankey, bool nextkey,
 		 * Find the appropriate item on the internal page, and get the child
 		 * page that it points to.
 		 */
-		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, nextkey);
+		offnum = _bt_binsrch(rel, *bufP, keysz, scankey, scantid, nextkey);
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
 		blkno = BTreeInnerTupleGetDownLink(itup);
@@ -215,6 +215,7 @@ _bt_moveright(Relation rel,
 			  Buffer buf,
 			  int keysz,
 			  ScanKey scankey,
+			  ItemPointer scantid,
 			  bool nextkey,
 			  bool forupdate,
 			  BTStack stack,
@@ -275,7 +276,7 @@ _bt_moveright(Relation rel,
 			continue;
 		}
 
-		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, page, P_HIKEY) >= cmpval)
+		if (P_IGNORE(opaque) || _bt_compare(rel, keysz, scankey, scantid, page, P_HIKEY) >= cmpval)
 		{
 			/* step right one page */
 			buf = _bt_relandgetbuf(rel, buf, opaque->btpo_next, access);
@@ -324,6 +325,7 @@ _bt_binsrch(Relation rel,
 			Buffer buf,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			bool nextkey)
 {
 	Page		page;
@@ -371,7 +373,7 @@ _bt_binsrch(Relation rel,
 
 		/* We have low <= mid < high, so mid points at a real slot */
 
-		result = _bt_compare(rel, keysz, scankey, page, mid);
+		result = _bt_compare(rel, keysz, scankey, scantid, page, mid);
 
 		if (result >= cmpval)
 			low = mid + 1;
@@ -428,24 +430,36 @@ int32
 _bt_compare(Relation rel,
 			int keysz,
 			ScanKey scankey,
+			ItemPointer scantid,
 			Page page,
 			OffsetNumber offnum)
 {
 	TupleDesc	itupdesc = RelationGetDescr(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	ItemPointer  heapTid;
 	IndexTuple	itup;
+	int			ntupatts;
+	int			ncmpkey;
 	int			i;
 
+	Assert(keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
 	Assert(_bt_check_natts(rel, page, offnum));
 
 	/*
 	 * Force result ">" if target item is first data item on an internal page
 	 * --- see NOTE above.
+	 *
+	 * A minus infinity key has all attributes truncated away, so this test is
+	 * redundant with the minus infinity attribute tie-breaker.  However, the
+	 * number of attributes in minus infinity tuples was not explicitly
+	 * represented as 0 until PostgreSQL v11, so an explicit offnum test is
+	 * still required.
 	 */
 	if (!P_ISLEAF(opaque) && offnum == P_FIRSTDATAKEY(opaque))
 		return 1;
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	ntupatts = BTreeTupleGetNAtts(itup, rel);
 
 	/*
 	 * The scan key is set up with the attribute number associated with each
@@ -459,7 +473,8 @@ _bt_compare(Relation rel,
 	 * _bt_first).
 	 */
 
-	for (i = 1; i <= keysz; i++)
+	ncmpkey = Min(ntupatts, keysz);
+	for (i = 1; i <= ncmpkey; i++)
 	{
 		Datum		datum;
 		bool		isNull;
@@ -510,8 +525,67 @@ _bt_compare(Relation rel,
 		scankey++;
 	}
 
-	/* if we get here, the keys are equal */
-	return 0;
+	/*
+	 * Use the number of attributes as a tie-breaker, in order to treat
+	 * truncated attributes in index as minus infinity.
+	 */
+	if (keysz > ntupatts)
+		return 1;
+
+	/* If caller provided no heap TID tie-breaker for scan, they're equal */
+	if (!scantid)
+		return 0;
+
+	/*
+	 * Although it isn't counted as an attribute by BTreeTupleGetNAtts(), heap
+	 * TID is an implicit final key attribute that ensures that all index
+	 * tuples have a distinct set of key attribute values.
+	 *
+	 * This is often truncated away in pivot tuples, which makes the attribute
+	 * value implicitly negative infinity.
+	 */
+	heapTid = BTreeTupleGetHeapTID(itup);
+	if (!heapTid)
+		return 1;
+	return ItemPointerCompare(scantid, heapTid);
+}
+
+/*
+ * Return how many attributes to leave when truncating.
+ *
+ * This only considers key attributes, since non-key attributes should always
+ * be truncated away.  We only need attributes up to and including the first
+ * distinguishing attribute.
+ *
+ * This can return a number of attributes that is one greater than the number
+ * of key attributes actually found in the first right tuple.  This indicates
+ * that the caller must use the leftmost heap TID as a unique-ifier in its new
+ * high key tuple.
+ */
+int
+_bt_leave_natts(Relation rel, Page leftpage, OffsetNumber lastleftoffnum,
+				IndexTuple firstright)
+{
+	int			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int			leavenatts;
+	ScanKey		skey;
+
+	skey = _bt_mkscankey(rel, firstright);
+
+	/*
+	 * Even test nkeyatts (untruncated) case, since caller cares about whether
+	 * or not it can avoid appending a heap TID as a unique-ifier
+	 */
+	for (leavenatts = 1; leavenatts <= nkeyatts; leavenatts++)
+	{
+		if (_bt_compare(rel, leavenatts, skey, NULL, leftpage, lastleftoffnum) > 0)
+			break;
+	}
+
+	/* Can't leak memory here */
+	_bt_freeskey(skey);
+
+	return leavenatts;
 }
 
 /*
@@ -1027,7 +1101,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	 * Use the manufactured insertion scan key to descend the tree and
 	 * position ourselves on the target leaf page.
 	 */
-	stack = _bt_search(rel, keysCount, scankeys, nextkey, &buf, BT_READ,
+	stack = _bt_search(rel, keysCount, scankeys, NULL, nextkey, &buf, BT_READ,
 					   scan->xs_snapshot);
 
 	/* don't need to keep the stack around... */
@@ -1057,7 +1131,7 @@ _bt_first(IndexScanDesc scan, ScanDirection dir)
 	_bt_initialize_more_data(so, dir);
 
 	/* position to the precise item on the page */
-	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, nextkey);
+	offnum = _bt_binsrch(rel, buf, keysCount, scankeys, NULL, nextkey);
 
 	/*
 	 * If nextkey = false, we are positioned at the first item >= scan key, or
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e012df5..07324d4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -805,8 +805,6 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	OffsetNumber last_off;
 	Size		pgspc;
 	Size		itupsz;
-	int			indnatts = IndexRelationGetNumberOfAttributes(wstate->index);
-	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(wstate->index);
 
 	/*
 	 * This is a handy place to check for cancel interrupts during the btree
@@ -889,17 +887,17 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		ItemIdSetUnused(ii);	/* redundant */
 		((PageHeader) opage)->pd_lower -= sizeof(ItemIdData);
 
-		if (indnkeyatts != indnatts && P_ISLEAF(opageop))
+		if (P_ISLEAF(opageop))
 		{
-			IndexTuple	truncated;
-			Size		truncsz;
+			OffsetNumber	lastleftoffnum = OffsetNumberPrev(last_off);
+			IndexTuple		truncated;
+			Size			truncsz;
 
 			/*
-			 * Truncate any non-key attributes from high key on leaf level
-			 * (i.e. truncate on leaf level if we're building an INCLUDE
-			 * index).  This is only done at the leaf level because downlinks
-			 * in internal pages are either negative infinity items, or get
-			 * their contents from copying from one level down.  See also:
+			 * Truncate away any unneeded attributes from high key on leaf
+			 * level.  This is only done at the leaf level because downlinks in
+			 * internal pages are either negative infinity items, or get their
+			 * contents from copying from one level down.  See also:
 			 * _bt_split().
 			 *
 			 * Since the truncated tuple is probably smaller than the
@@ -913,8 +911,12 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * only shift the line pointer array back and forth, and overwrite
 			 * the latter portion of the space occupied by the original tuple.
 			 * This is fairly cheap.
+			 *
+			 * TODO: Give a little weight to how large the final downlink will
+			 * be when deciding on a split point.
 			 */
-			truncated = _bt_nonkey_truncate(wstate->index, oitup);
+			truncated = _bt_suffix_truncate(wstate->index, opage,
+											lastleftoffnum, oitup);
 			truncsz = IndexTupleSize(truncated);
 			PageIndexTupleDelete(opage, P_HIKEY);
 			_bt_sortaddtup(opage, truncsz, truncated, P_HIKEY);
@@ -933,8 +935,9 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		if (state->btps_next == NULL)
 			state->btps_next = _bt_pagestate(wstate, state->btps_level + 1);
 
-		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) ==
-			   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+		Assert((BTreeTupleGetNAtts(state->btps_minkey, wstate->index) <=
+				IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+				BTreeTupleGetNAtts(state->btps_minkey, wstate->index) > 0) ||
 			   P_LEFTMOST(opageop));
 		Assert(BTreeTupleGetNAtts(state->btps_minkey, wstate->index) == 0 ||
 			   !P_LEFTMOST(opageop));
@@ -979,7 +982,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 	 * the first item for a page is copied from the prior page in the code
 	 * above.  Since the minimum key for an entire level is only used as a
 	 * minus infinity downlink, and never as a high key, there is no need to
-	 * truncate away non-key attributes at this point.
+	 * truncate away suffix attributes at this point.
 	 */
 	if (last_off == P_HIKEY)
 	{
@@ -1038,8 +1041,9 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 		}
 		else
 		{
-			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) ==
-				   IndexRelationGetNumberOfKeyAttributes(wstate->index) ||
+			Assert((BTreeTupleGetNAtts(s->btps_minkey, wstate->index) <=
+					IndexRelationGetNumberOfKeyAttributes(wstate->index) &&
+					BTreeTupleGetNAtts(s->btps_minkey, wstate->index) > 0) ||
 				   P_LEFTMOST(opaque));
 			Assert(BTreeTupleGetNAtts(s->btps_minkey, wstate->index) == 0 ||
 				   !P_LEFTMOST(opaque));
@@ -1136,6 +1140,8 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 			}
 			else if (itup != NULL)
 			{
+				int32		compare = 0;
+
 				for (i = 1; i <= keysz; i++)
 				{
 					SortSupport entry;
@@ -1143,7 +1149,6 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 								attrDatum2;
 					bool		isNull1,
 								isNull2;
-					int32		compare;
 
 					entry = sortKeys + i - 1;
 					attrDatum1 = index_getattr(itup, i, tupdes, &isNull1);
@@ -1160,6 +1165,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 					else if (compare < 0)
 						break;
 				}
+
+				/*
+				 * If key values are equal, we sort on ItemPointer.  This is
+				 * required for btree indexes, since heap TID is treated as an
+				 * implicit last key attribute in order to ensure that all keys
+				 * in the index are physically unique.
+				 */
+				if (compare == 0)
+				{
+					compare = ItemPointerCompare(&itup->t_tid, &itup2->t_tid);
+					Assert(compare != 0);
+					if (compare > 0)
+						load1 = false;
+				}
 			}
 			else
 				load1 = false;
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index acb9443..b9f9883 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -56,27 +56,34 @@ static bool _bt_check_rowcompare(ScanKey skey,
  *		Build an insertion scan key that contains comparison data from itup
  *		as well as comparator routines appropriate to the key datatypes.
  *
- *		The result is intended for use with _bt_compare().
+ *		The result is intended for use with _bt_compare().  If itup has
+ *		undergone suffix truncation of key attributes, caller had better
+ *		pass BTreeTupleGetNAtts(itup, rel) as keysz to routines like
+ *		_bt_search() and _bt_compare() when using returned scan key.  This
+ *		allows truncated attributes to participate in comparisons (truncated
+ *		attributes have implicit negative infinity values).  Note that
+ *		_bt_compare() never treats a scan key as containing negative
+ *		infinity attributes.
  */
 ScanKey
 _bt_mkscankey(Relation rel, IndexTuple itup)
 {
 	ScanKey		skey;
 	TupleDesc	itupdesc;
+	int			tupnatts;
 	int			indnatts PG_USED_FOR_ASSERTS_ONLY;
 	int			indnkeyatts;
 	int16	   *indoption;
 	int			i;
 
 	itupdesc = RelationGetDescr(rel);
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 	indnatts = IndexRelationGetNumberOfAttributes(rel);
 	indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	indoption = rel->rd_indoption;
 
-	Assert(indnkeyatts > 0);
-	Assert(indnkeyatts <= indnatts);
-	Assert(BTreeTupleGetNAtts(itup, rel) == indnatts ||
-		   BTreeTupleGetNAtts(itup, rel) == indnkeyatts);
+	Assert(tupnatts > 0);
+	Assert(tupnatts <= indnatts);
 
 	/*
 	 * We'll execute search using scan key constructed on key columns. Non-key
@@ -96,7 +103,21 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 		 * comparison can be needed.
 		 */
 		procinfo = index_getprocinfo(rel, i + 1, BTORDER_PROC);
-		arg = index_getattr(itup, i + 1, itupdesc, &null);
+
+		/*
+		 * Truncated key attributes may not be represented in index tuple
+		 * due to suffix truncation.  Keys built from truncated attributes
+		 * are defensively represented as NULL values, though they should
+		 * still not be allowed to participate in comparisons (caller must
+		 * be sure to pass a sane keysz to _bt_compare()).
+		 */
+		if (i < tupnatts)
+			arg = index_getattr(itup, i + 1, itupdesc, &null);
+		else
+		{
+			arg = (Datum) 0;
+			null = true;
+		}
 		flags = (null ? SK_ISNULL : 0) | (indoption[i] << SK_BT_INDOPTION_SHIFT);
 		ScanKeyEntryInitializeWithInfo(&skey[i],
 									   flags,
@@ -2083,38 +2104,116 @@ btproperty(Oid index_oid, int attno,
 }
 
 /*
- *	_bt_nonkey_truncate() -- create tuple without non-key suffix attributes.
+ *	_bt_suffix_truncate() -- create tuple without unneeded suffix attributes.
  *
  * Returns truncated index tuple allocated in caller's memory context, with key
- * attributes copied from caller's itup argument.  Currently, suffix truncation
- * is only performed to create pivot tuples in INCLUDE indexes, but some day it
- * could be generalized to remove suffix attributes after the first
- * distinguishing key attribute.
+ * attributes copied from caller's itup argument.  If rel is an INCLUDE index,
+ * non-key attributes are always truncated away, since they're not part of the
+ * key space, and are not used in pivot tuples.  More aggressive suffix
+ * truncation can take place when it's clear that the returned tuple does not
+ * need one or more suffix key attributes.  This is possible when there are
+ * attributes after an already distinct pair of attributes.
  *
- * Truncated tuple is guaranteed to be no larger than the original, which is
- * important for staying under the 1/3 of a page restriction on tuple size.
+ * Truncated tuple is guaranteed to be no larger than the original plus space
+ * for an extra heap TID tie-breaker attribute, which is important for staying
+ * under the 1/3 of a page restriction on tuple size.
  *
  * Note that returned tuple's t_tid offset will hold the number of attributes
  * present, so the original item pointer offset is not represented.  Caller
- * should only change truncated tuple's downlink.
+ * should only change truncated tuple's downlink.  Note also that truncated key
+ * attributes are treated as containing "minus infinity" values by
+ * _bt_compare().
  */
 IndexTuple
-_bt_nonkey_truncate(Relation rel, IndexTuple itup)
+_bt_suffix_truncate(Relation rel, Page leftpage, OffsetNumber lastleftoffnum,
+					IndexTuple firstright)
 {
-	int			nkeyattrs = IndexRelationGetNumberOfKeyAttributes(rel);
-	IndexTuple	truncated;
+	TupleDesc		itupdesc = RelationGetDescr(rel);
+	int16			natts = IndexRelationGetNumberOfAttributes(rel);
+	int16			nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
+	int				leavenatts;
+	IndexTuple		pivot;
+	ItemId			lastleftitem;
+	IndexTuple		lastlefttuple;
+	Size			newsize;
+
 
 	/*
-	 * We should only ever truncate leaf index tuples, which must have both
-	 * key and non-key attributes.  It's never okay to truncate a second time.
+	 * We should only ever truncate leaf index tuples, which must have non-key
+	 * attributes in the case of INCLUDE indexes.  It's never okay to truncate
+	 * a second time.
 	 */
-	Assert(BTreeTupleGetNAtts(itup, rel) ==
-		   IndexRelationGetNumberOfAttributes(rel));
+	Assert(BTreeTupleGetNAtts(firstright, rel) == natts);
 
-	truncated = index_truncate_tuple(RelationGetDescr(rel), itup, nkeyattrs);
-	BTreeTupleSetNAtts(truncated, nkeyattrs);
+	/* Determine how many attributes must be left behind */
+	leavenatts = _bt_leave_natts(rel, leftpage, lastleftoffnum, firstright);
 
-	return truncated;
+	if (leavenatts <= natts)
+	{
+		IndexTuple		tidpivot;
+
+		/*
+		 * Truncate away non-key attributes and/or key attributes.  Do a
+		 * straight copy in the case where the only attribute to be "truncated
+		 * away" is the implicit heap TID key attribute (i.e. the case where we
+		 * can at least avoid adding an explicit heap TID attribute to new
+		 * pivot).
+		 */
+		if (leavenatts < natts)
+			pivot = index_truncate_tuple(itupdesc, firstright, leavenatts);
+		else
+			pivot = CopyIndexTuple(firstright);
+
+		/*
+		 * If there is a distinguishing key attribute within leavenatts, there
+		 * is no need to add an explicit heap TID attribute to new pivot.
+		 */
+		if (leavenatts <= nkeyatts)
+		{
+			BTreeTupleSetNAtts(pivot, leavenatts);
+			return pivot;
+		}
+
+		/*
+		 * Only non-key attributes could be truncated away.  They are not
+		 * considered part of the key space, so it's still necessary to add a
+		 * heap TID attribute to the new pivot tuple.  Create enlarged copy of
+		 * truncated right tuple copy, to fit heap TID.
+		 */
+		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
+		tidpivot = palloc0(newsize);
+		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
+		pfree(pivot);
+		pivot = tidpivot;
+	}
+	else
+	{
+		/*
+		 * No truncation was possible. Create enlarged copy of first right
+		 * tuple, to fit heap TID
+		 */
+		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, IndexTupleSize(firstright));
+	}
+
+	/*
+	 * We must use heap TID as a unique-ifier in new pivot tuple, since no user
+	 * key attributes could be truncated away.  The heap TID must comes from
+	 * the last tuple on the left page, since new downlinks must be a strict
+	 * lower bound on new right page.
+	 */
+	pivot->t_info &= ~INDEX_SIZE_MASK;
+	pivot->t_info |= newsize;
+	/* Copy last left item's heap TID into new pivot tuple */
+	lastleftitem = PageGetItemId(leftpage, lastleftoffnum);
+	lastlefttuple = (IndexTuple) PageGetItem(leftpage, lastleftitem);
+	memcpy((char *) pivot + newsize - MAXALIGN(sizeof(ItemPointerData)),
+		   &lastlefttuple->t_tid, sizeof(ItemPointerData));
+	/* Tuple has all key attributes */
+	BTreeTupleSetNAtts(pivot, nkeyatts);
+	BTreeTupleSetHeapTID(pivot);
+	return pivot;
 }
 
 /*
@@ -2137,6 +2236,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 	int16		nkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 	BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
 	IndexTuple	itup;
+	int			tupnatts;
 
 	/*
 	 * We cannot reliably test a deleted or half-deleted page, since they have
@@ -2156,6 +2256,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 					 "BT_N_KEYS_OFFSET_MASK can't fit INDEX_MAX_KEYS");
 
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
+	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
 	if (P_ISLEAF(opaque))
 	{
@@ -2165,7 +2266,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == natts;
+			return tupnatts == natts;
 		}
 		else
 		{
@@ -2176,7 +2277,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			Assert(!P_RIGHTMOST(opaque));
 
 			/* Page high key tuple contains only key attributes */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 	}
 	else						/* !P_ISLEAF(opaque) */
@@ -2209,7 +2310,7 @@ _bt_check_natts(Relation rel, Page page, OffsetNumber offnum)
 			 * Tuple contains only key attributes despite on is it page high
 			 * key or not
 			 */
-			return BTreeTupleGetNAtts(itup, rel) == nkeyatts;
+			return tupnatts > 0 && tupnatts <= nkeyatts;
 		}
 
 	}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 67a94cb..f1d286c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -252,6 +252,9 @@ btree_xlog_split(bool onleft, bool lhighkey, XLogReaderState *record)
 	 * When the high key isn't present is the wal record, then we assume it to
 	 * be equal to the first key on the right page.  It must be from the leaf
 	 * level.
+	 *
+	 * FIXME:  We current always log the high key.  Is it worth trying to
+	 * salvage case where logging isn't strictly necessary, and can be avoided?
 	 */
 	if (!lhighkey)
 	{
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 9fb33b9..6e8a216 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -4057,9 +4057,10 @@ comparetup_index_btree(const SortTuple *a, const SortTuple *b,
 	}
 
 	/*
-	 * If key values are equal, we sort on ItemPointer.  This does not affect
-	 * validity of the finished index, but it may be useful to have index
-	 * scans in physical order.
+	 * If key values are equal, we sort on ItemPointer.  This is required
+	 * for btree indexes, since heap TID is treated as an implicit last
+	 * key attribute in order to ensure that all keys in the index are
+	 * physically unique.
 	 */
 	{
 		BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 3b9461c..0776324 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -122,11 +122,21 @@ typedef struct BTMetaPageData
  *
  * We actually need to be able to fit three items on every page,
  * so restrict any one item to 1/3 the per-page available space.
+ *
+ * There are rare cases where _bt_suffix_truncate() will need to enlarge
+ * a heap index tuple to make space for a tie-breaker heap TID
+ * attribute, which we account for here.
  */
-#define BTMaxItemSize(page) \
+#define BTMaxItemSizeOld(page) \
 	MAXALIGN_DOWN((PageGetPageSize(page) - \
 				   MAXALIGN(SizeOfPageHeaderData + 3*sizeof(ItemIdData)) - \
 				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
+#define BTMaxItemSize(page) \
+	MAXALIGN_DOWN((PageGetPageSize(page) - \
+				   MAXALIGN(SizeOfPageHeaderData + \
+							3*sizeof(ItemIdData)  + \
+							3*MAXALIGN(sizeof(ItemPointerData))) - \
+				   MAXALIGN(sizeof(BTPageOpaqueData))) / 3)
 
 /*
  * The leaf-page fillfactor defaults to 90% but is user-adjustable.
@@ -204,12 +214,10 @@ typedef struct BTMetaPageData
  * real offset (downlinks only need to store a block number).  The offset
  * field only stores the number of attributes when the INDEX_ALT_TID_MASK
  * bit is set (we never assume that pivot tuples must explicitly store the
- * number of attributes, and currently do not bother storing the number of
- * attributes unless indnkeyatts actually differs from indnatts).
- * INDEX_ALT_TID_MASK is only used for pivot tuples at present, though it's
- * possible that it will be used within non-pivot tuples in the future.  Do
- * not assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot
- * tuple.
+ * number of attributes).  INDEX_ALT_TID_MASK is only used for pivot tuples
+ * at present, though it's possible that it will be used within non-pivot
+ * tuples in the future.  Do not assume that a tuple with INDEX_ALT_TID_MASK
+ * set must be a pivot tuple.
  *
  * The 12 least significant offset bits are used to represent the number of
  * attributes in INDEX_ALT_TID_MASK tuples, leaving 4 bits that are reserved
@@ -219,6 +227,8 @@ typedef struct BTMetaPageData
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+/* Reserved to indicate if heap TID is represented in pivot tuple */
+#define BT_HEAP_TID_ATTR			0x1000
 
 /* Get/set downlink block number */
 #define BTreeInnerTupleGetDownLink(itup) \
@@ -241,14 +251,15 @@ typedef struct BTMetaPageData
 	} while(0)
 
 /*
- * Get/set number of attributes within B-tree index tuple. Asserts should be
- * removed when BT_RESERVED_OFFSET_MASK bits will be used.
+ * Get/set number of attributes within B-tree index tuple.
+ *
+ * Note that this does not include an implicit tie-breaker heap-TID
+ * attribute, if any.
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
 		(itup)->t_info & INDEX_ALT_TID_MASK ? \
 		( \
-			AssertMacro((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_RESERVED_OFFSET_MASK) == 0), \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
@@ -257,11 +268,33 @@ typedef struct BTMetaPageData
 #define BTreeTupleSetNAtts(itup, n) \
 	do { \
 		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		Assert(((n) & BT_RESERVED_OFFSET_MASK) == 0); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
 	} while(0)
 
 /*
+ * Get/set implicit tie-breaker heap-TID attribute, if any.
+ *
+ * Assumes that any tuple without INDEX_ALT_TID_MASK set has t_tid that
+ * points to heap.
+ */
+#define BTreeTupleGetHeapTID(itup) \
+	( \
+	  (itup)->t_info & INDEX_ALT_TID_MASK && \
+	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
+	  ( \
+		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
+					   MAXALIGN(sizeof(ItemPointerData))) \
+	  ) \
+	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &(itup)->t_tid \
+	)
+#define BTreeTupleSetHeapTID(itup) \
+	do { \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
+	} while(0)
+
+/*
  *	Operator strategy numbers for B-tree have been moved to access/stratnum.h,
  *	because many places need to use them in ScanKeyInit() calls.
  *
@@ -564,15 +597,17 @@ extern int	_bt_pagedel(Relation rel, Buffer buf);
  * prototypes for functions in nbtsearch.c
  */
 extern BTStack _bt_search(Relation rel,
-		   int keysz, ScanKey scankey, bool nextkey,
+		   int keysz, ScanKey scankey, ItemPointer scantid, bool nextkey,
 		   Buffer *bufP, int access, Snapshot snapshot);
 extern Buffer _bt_moveright(Relation rel, Buffer buf, int keysz,
-			  ScanKey scankey, bool nextkey, bool forupdate, BTStack stack,
-			  int access, Snapshot snapshot);
+			  ScanKey scankey, ItemPointer scantid, bool nextkey,
+			  bool forupdate, BTStack stack, int access, Snapshot snapshot);
 extern OffsetNumber _bt_binsrch(Relation rel, Buffer buf, int keysz,
-			ScanKey scankey, bool nextkey);
+			ScanKey scankey, ItemPointer scantid, bool nextkey);
 extern int32 _bt_compare(Relation rel, int keysz, ScanKey scankey,
-			Page page, OffsetNumber offnum);
+			ItemPointer scantid, Page page, OffsetNumber offnum);
+extern int _bt_leave_natts(Relation rel, Page leftpage,
+						   OffsetNumber lastleftoffnum, IndexTuple firstright);
 extern bool _bt_first(IndexScanDesc scan, ScanDirection dir);
 extern bool _bt_next(IndexScanDesc scan, ScanDirection dir);
 extern Buffer _bt_get_endpoint(Relation rel, uint32 level, bool rightmost,
@@ -605,7 +640,9 @@ extern bytea *btoptions(Datum reloptions, bool validate);
 extern bool btproperty(Oid index_oid, int attno,
 		   IndexAMProperty prop, const char *propname,
 		   bool *res, bool *isnull);
-extern IndexTuple _bt_nonkey_truncate(Relation rel, IndexTuple itup);
+extern IndexTuple _bt_suffix_truncate(Relation rel, Page leftpage,
+									  OffsetNumber lastleftoffnum,
+									  IndexTuple firstright);
 extern bool _bt_check_natts(Relation rel, Page page, OffsetNumber offnum);
 
 /*
diff --git a/src/test/modules/test_extensions/expected/test_extensions.out b/src/test/modules/test_extensions/expected/test_extensions.out
index 28d86c4..29b4ec9 100644
--- a/src/test/modules/test_extensions/expected/test_extensions.out
+++ b/src/test/modules/test_extensions/expected/test_extensions.out
@@ -30,10 +30,10 @@ NOTICE:  installing required extension "test_ext_cyclic2"
 ERROR:  cyclic dependency detected between extensions "test_ext_cyclic1" and "test_ext_cyclic2"
 DROP SCHEMA test_ext CASCADE;
 NOTICE:  drop cascades to 5 other objects
-DETAIL:  drop cascades to extension test_ext3
-drop cascades to extension test_ext5
-drop cascades to extension test_ext2
+DETAIL:  drop cascades to extension test_ext5
 drop cascades to extension test_ext4
+drop cascades to extension test_ext3
+drop cascades to extension test_ext2
 drop cascades to extension test_ext1
 CREATE EXTENSION test_ext6;
 DROP EXTENSION test_ext6;
diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out
index f85e913..4f0089a 100644
--- a/src/test/regress/expected/aggregates.out
+++ b/src/test/regress/expected/aggregates.out
@@ -942,9 +942,9 @@ select distinct min(f1), max(f1) from minmaxtest;
 
 drop table minmaxtest cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table minmaxtest1
+DETAIL:  drop cascades to table minmaxtest3
 drop cascades to table minmaxtest2
-drop cascades to table minmaxtest3
+drop cascades to table minmaxtest1
 -- check for correct detection of nested-aggregate errors
 select max(min(unique1)) from tenk1;
 ERROR:  aggregate function calls cannot be nested
diff --git a/src/test/regress/expected/alter_table.out b/src/test/regress/expected/alter_table.out
index 702bf9f..c23826f 100644
--- a/src/test/regress/expected/alter_table.out
+++ b/src/test/regress/expected/alter_table.out
@@ -2672,19 +2672,19 @@ select alter2.plus1(41);
 -- clean up
 drop schema alter2 cascade;
 NOTICE:  drop cascades to 13 other objects
-DETAIL:  drop cascades to table alter2.t1
-drop cascades to view alter2.v1
-drop cascades to function alter2.plus1(integer)
-drop cascades to type alter2.posint
-drop cascades to operator family alter2.ctype_hash_ops for access method hash
+DETAIL:  drop cascades to text search template alter2.tmpl
+drop cascades to text search dictionary alter2.dict
+drop cascades to text search parser alter2.prs
+drop cascades to text search configuration alter2.cfg
+drop cascades to conversion alter2.ascii_to_utf8
 drop cascades to type alter2.ctype
 drop cascades to function alter2.same(alter2.ctype,alter2.ctype)
 drop cascades to operator alter2.=(alter2.ctype,alter2.ctype)
-drop cascades to conversion alter2.ascii_to_utf8
-drop cascades to text search parser alter2.prs
-drop cascades to text search configuration alter2.cfg
-drop cascades to text search template alter2.tmpl
-drop cascades to text search dictionary alter2.dict
+drop cascades to operator family alter2.ctype_hash_ops for access method hash
+drop cascades to type alter2.posint
+drop cascades to function alter2.plus1(integer)
+drop cascades to table alter2.t1
+drop cascades to view alter2.v1
 --
 -- composite types
 --
diff --git a/src/test/regress/expected/collate.out b/src/test/regress/expected/collate.out
index fcbe3a5..95c7335 100644
--- a/src/test/regress/expected/collate.out
+++ b/src/test/regress/expected/collate.out
@@ -668,4 +668,4 @@ SELECT collation for ((SELECT b FROM collate_test1 LIMIT 1));
 --
 \set VERBOSITY terse
 DROP SCHEMA collate_tests CASCADE;
-NOTICE:  drop cascades to 17 other objects
+NOTICE:  drop cascades to 20 other objects
diff --git a/src/test/regress/expected/create_type.out b/src/test/regress/expected/create_type.out
index 2f7d5f9..8309756 100644
--- a/src/test/regress/expected/create_type.out
+++ b/src/test/regress/expected/create_type.out
@@ -161,13 +161,13 @@ DROP FUNCTION base_fn_out(opaque); -- error
 ERROR:  function base_fn_out(opaque) does not exist
 DROP TYPE base_type; -- error
 ERROR:  cannot drop type base_type because other objects depend on it
-DETAIL:  function base_fn_out(base_type) depends on type base_type
-function base_fn_in(cstring) depends on type base_type
+DETAIL:  function base_fn_in(cstring) depends on type base_type
+function base_fn_out(base_type) depends on type base_type
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE base_type CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to function base_fn_out(base_type)
-drop cascades to function base_fn_in(cstring)
+DETAIL:  drop cascades to function base_fn_in(cstring)
+drop cascades to function base_fn_out(base_type)
 -- Check usage of typmod with a user-defined type
 -- (we have borrowed numeric's typmod functions)
 CREATE TEMP TABLE mytab (foo widget(42,13,7));     -- should fail
diff --git a/src/test/regress/expected/create_view.out b/src/test/regress/expected/create_view.out
index 141fc6d..8abcd7b 100644
--- a/src/test/regress/expected/create_view.out
+++ b/src/test/regress/expected/create_view.out
@@ -1711,4 +1711,4 @@ select pg_get_ruledef(oid, true) from pg_rewrite
 DROP SCHEMA temp_view_test CASCADE;
 NOTICE:  drop cascades to 27 other objects
 DROP SCHEMA testviewschm2 CASCADE;
-NOTICE:  drop cascades to 62 other objects
+NOTICE:  drop cascades to 63 other objects
diff --git a/src/test/regress/expected/dependency.out b/src/test/regress/expected/dependency.out
index 8e50f8f..8d31110 100644
--- a/src/test/regress/expected/dependency.out
+++ b/src/test/regress/expected/dependency.out
@@ -128,9 +128,9 @@ FROM pg_type JOIN pg_class c ON typrelid = c.oid WHERE typname = 'deptest_t';
 -- doesn't work: grant still exists
 DROP USER regress_dep_user1;
 ERROR:  role "regress_dep_user1" cannot be dropped because some objects depend on it
-DETAIL:  owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
+DETAIL:  privileges for table deptest1
 privileges for database regression
-privileges for table deptest1
+owner of default privileges on new relations belonging to role regress_dep_user1 in schema deptest
 DROP OWNED BY regress_dep_user1;
 DROP USER regress_dep_user1;
 \set VERBOSITY terse
diff --git a/src/test/regress/expected/domain.out b/src/test/regress/expected/domain.out
index 0b5a904..49fb047 100644
--- a/src/test/regress/expected/domain.out
+++ b/src/test/regress/expected/domain.out
@@ -645,8 +645,8 @@ alter domain dnotnulltest drop not null;
 update domnotnull set col1 = null;
 drop domain dnotnulltest cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to column col1 of table domnotnull
-drop cascades to column col2 of table domnotnull
+DETAIL:  drop cascades to column col2 of table domnotnull
+drop cascades to column col1 of table domnotnull
 -- Test ALTER DOMAIN .. DEFAULT ..
 create table domdeftest (col1 ddef1);
 insert into domdeftest default values;
diff --git a/src/test/regress/expected/event_trigger.out b/src/test/regress/expected/event_trigger.out
index 008e859..17514e6 100644
--- a/src/test/regress/expected/event_trigger.out
+++ b/src/test/regress/expected/event_trigger.out
@@ -152,9 +152,9 @@ ERROR:  event trigger "regress_event_trigger" does not exist
 -- should fail, regress_evt_user owns some objects
 drop role regress_evt_user;
 ERROR:  role "regress_evt_user" cannot be dropped because some objects depend on it
-DETAIL:  owner of event trigger regress_event_trigger3
+DETAIL:  owner of user mapping for regress_evt_user on server useless_server
 owner of default privileges on new relations belonging to role regress_evt_user
-owner of user mapping for regress_evt_user on server useless_server
+owner of event trigger regress_event_trigger3
 -- cleanup before next test
 -- these are all OK; the second one should emit a NOTICE
 drop event trigger if exists regress_event_trigger2;
@@ -241,14 +241,13 @@ CREATE EVENT TRIGGER regress_event_trigger_drop_objects ON sql_drop
 ALTER TABLE schema_one.table_one DROP COLUMN a;
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
 ERROR:  object audit_tbls.schema_two_table_three of type table cannot be dropped
 CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
@@ -257,61 +256,61 @@ PL/pgSQL function test_evtrig_dropped_objects() line 8 at EXECUTE
 DELETE FROM undroppable_objs WHERE object_identity = 'audit_tbls.schema_two_table_three';
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
-NOTICE:  table "schema_one_table_one" does not exist, skipping
-NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_two_table_two" does not exist, skipping
 NOTICE:  table "schema_one_table_three" does not exist, skipping
+NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_one_table_one" does not exist, skipping
 ERROR:  object schema_one.table_three of type table cannot be dropped
 CONTEXT:  PL/pgSQL function undroppable() line 14 at RAISE
 DELETE FROM undroppable_objs WHERE object_identity = 'schema_one.table_three';
 DROP SCHEMA schema_one, schema_two CASCADE;
 NOTICE:  drop cascades to 7 other objects
-DETAIL:  drop cascades to table schema_two.table_two
-drop cascades to table schema_two.table_three
-drop cascades to function schema_two.add(integer,integer)
+DETAIL:  drop cascades to function schema_two.add(integer,integer)
 drop cascades to function schema_two.newton(integer)
-drop cascades to table schema_one.table_one
-drop cascades to table schema_one."table two"
+drop cascades to table schema_two.table_three
+drop cascades to table schema_two.table_two
 drop cascades to table schema_one.table_three
-NOTICE:  table "schema_two_table_two" does not exist, skipping
+drop cascades to table schema_one."table two"
+drop cascades to table schema_one.table_one
 NOTICE:  table "audit_tbls_schema_two_table_three" does not exist, skipping
-NOTICE:  table "schema_one_table_one" does not exist, skipping
-NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_two_table_two" does not exist, skipping
 NOTICE:  table "schema_one_table_three" does not exist, skipping
+NOTICE:  table "schema_one_table two" does not exist, skipping
+NOTICE:  table "schema_one_table_one" does not exist, skipping
 SELECT * FROM dropped_objects WHERE schema IS NULL OR schema <> 'pg_toast';
      type     |   schema   |               object                
 --------------+------------+-------------------------------------
  table column | schema_one | schema_one.table_one.a
  schema       |            | schema_two
- table        | schema_two | schema_two.table_two
- type         | schema_two | schema_two.table_two
- type         | schema_two | schema_two.table_two[]
+ function     | schema_two | schema_two.add(integer,integer)
+ aggregate    | schema_two | schema_two.newton(integer)
  table        | audit_tbls | audit_tbls.schema_two_table_three
  type         | audit_tbls | audit_tbls.schema_two_table_three
  type         | audit_tbls | audit_tbls.schema_two_table_three[]
  table        | schema_two | schema_two.table_three
  type         | schema_two | schema_two.table_three
  type         | schema_two | schema_two.table_three[]
- function     | schema_two | schema_two.add(integer,integer)
- aggregate    | schema_two | schema_two.newton(integer)
+ table        | schema_two | schema_two.table_two
+ type         | schema_two | schema_two.table_two
+ type         | schema_two | schema_two.table_two[]
  schema       |            | schema_one
- table        | schema_one | schema_one.table_one
- type         | schema_one | schema_one.table_one
- type         | schema_one | schema_one.table_one[]
- table        | schema_one | schema_one."table two"
- type         | schema_one | schema_one."table two"
- type         | schema_one | schema_one."table two"[]
  table        | schema_one | schema_one.table_three
  type         | schema_one | schema_one.table_three
  type         | schema_one | schema_one.table_three[]
+ table        | schema_one | schema_one."table two"
+ type         | schema_one | schema_one."table two"
+ type         | schema_one | schema_one."table two"[]
+ table        | schema_one | schema_one.table_one
+ type         | schema_one | schema_one.table_one
+ type         | schema_one | schema_one.table_one[]
 (23 rows)
 
 DROP OWNED BY regress_evt_user;
@@ -360,13 +359,13 @@ DROP INDEX evttrig.one_idx;
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=index identity=evttrig.one_idx name={evttrig,one_idx} args={}
 DROP SCHEMA evttrig CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to table evttrig.one
-drop cascades to table evttrig.two
+DETAIL:  drop cascades to table evttrig.two
+drop cascades to table evttrig.one
 NOTICE:  NORMAL: orig=t normal=f istemp=f type=schema identity=evttrig name={evttrig} args={}
+NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.one name={evttrig,one} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=sequence identity=evttrig.one_col_a_seq name={evttrig,one_col_a_seq} args={}
 NOTICE:  NORMAL: orig=f normal=t istemp=f type=default value identity=for evttrig.one.col_a name={evttrig,one,col_a} args={}
-NOTICE:  NORMAL: orig=f normal=t istemp=f type=table identity=evttrig.two name={evttrig,two} args={}
 DROP TABLE a_temp_tbl;
 NOTICE:  NORMAL: orig=t normal=f istemp=t type=table identity=pg_temp.a_temp_tbl name={pg_temp,a_temp_tbl} args={}
 DROP EVENT TRIGGER regress_event_trigger_report_dropped;
diff --git a/src/test/regress/expected/foreign_data.out b/src/test/regress/expected/foreign_data.out
index 339a43f..b5c7801 100644
--- a/src/test/regress/expected/foreign_data.out
+++ b/src/test/regress/expected/foreign_data.out
@@ -441,8 +441,8 @@ ALTER SERVER s1 OWNER TO regress_test_indirect;
 RESET ROLE;
 DROP ROLE regress_test_indirect;                            -- ERROR
 ERROR:  role "regress_test_indirect" cannot be dropped because some objects depend on it
-DETAIL:  owner of server s1
-privileges for foreign-data wrapper foo
+DETAIL:  privileges for foreign-data wrapper foo
+owner of server s1
 \des+
                                                                                  List of foreign servers
  Name |           Owner           | Foreign-data wrapper |                   Access privileges                   |  Type  | Version |             FDW options              | Description 
@@ -2049,9 +2049,9 @@ DROP TABLE fd_pt2;
 DROP SCHEMA foreign_schema CASCADE;
 DROP ROLE regress_test_role;                                -- ERROR
 ERROR:  role "regress_test_role" cannot be dropped because some objects depend on it
-DETAIL:  privileges for server s4
+DETAIL:  owner of user mapping for regress_test_role on server s6
 privileges for foreign-data wrapper foo
-owner of user mapping for regress_test_role on server s6
+privileges for server s4
 DROP SERVER t1 CASCADE;
 NOTICE:  drop cascades to user mapping for public on server t1
 DROP USER MAPPING FOR regress_test_role SERVER s6;
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index b90c492..368cc45 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -255,7 +255,7 @@ SELECT * FROM FKTABLE;
 -- this should fail for lack of CASCADE
 DROP TABLE PKTABLE;
 ERROR:  cannot drop table pktable because other objects depend on it
-DETAIL:  constraint constrname2 on table fktable depends on table pktable
+DETAIL:  constraint constrname2 on table fktable depends on index pktable_pkey
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TABLE PKTABLE CASCADE;
 NOTICE:  drop cascades to constraint constrname2 on table fktable
diff --git a/src/test/regress/expected/indexing.out b/src/test/regress/expected/indexing.out
index 2c2bf44..b3bddc0 100644
--- a/src/test/regress/expected/indexing.out
+++ b/src/test/regress/expected/indexing.out
@@ -101,8 +101,8 @@ create table idxpart (a int) partition by range (a);
 create index on idxpart (a);
 create table idxpart1 partition of idxpart for values from (0) to (10);
 drop index idxpart1_a_idx;	-- no way
-ERROR:  cannot drop index idxpart1_a_idx because index idxpart_a_idx requires it
-HINT:  You can drop index idxpart_a_idx instead.
+ERROR:  cannot drop index idxpart1_a_idx because column a of table idxpart1 requires it
+HINT:  You can drop column a of table idxpart1 instead.
 drop index idxpart_a_idx;	-- both indexes go away
 select relname, relkind from pg_class
   where relname like 'idxpart%' order by relname;
@@ -931,11 +931,11 @@ select indrelid::regclass, indexrelid::regclass, inhparent::regclass, indisvalid
 (3 rows)
 
 drop index idxpart0_pkey;								-- fail
-ERROR:  cannot drop index idxpart0_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart0_pkey because constraint idxpart0_pkey on table idxpart0 requires it
+HINT:  You can drop constraint idxpart0_pkey on table idxpart0 instead.
 drop index idxpart1_pkey;								-- fail
-ERROR:  cannot drop index idxpart1_pkey because index idxpart_pkey requires it
-HINT:  You can drop index idxpart_pkey instead.
+ERROR:  cannot drop index idxpart1_pkey because constraint idxpart1_pkey on table idxpart1 requires it
+HINT:  You can drop constraint idxpart1_pkey on table idxpart1 instead.
 alter table idxpart0 drop constraint idxpart0_pkey;		-- fail
 ERROR:  cannot drop inherited constraint "idxpart0_pkey" of relation "idxpart0"
 alter table idxpart1 drop constraint idxpart1_pkey;		-- fail
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index b2b912e..3922723 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -952,8 +952,8 @@ Inherits: c1,
 
 drop table p1 cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table c1
-drop cascades to table c2
+DETAIL:  drop cascades to table c2
+drop cascades to table c1
 drop cascades to table c3
 drop table p2 cascade;
 create table pp1 (f1 int);
@@ -1098,9 +1098,9 @@ SELECT a.attrelid::regclass, a.attname, a.attinhcount, e.expected
 
 DROP TABLE inht1, inhs1 CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table inht2
+DETAIL:  drop cascades to table inht3
+drop cascades to table inht2
 drop cascades to table inhts
-drop cascades to table inht3
 drop cascades to table inht4
 -- Test non-inheritable indices [UNIQUE, EXCLUDE] constraints
 CREATE TABLE test_constraints (id int, val1 varchar, val2 int, UNIQUE(val1, val2));
@@ -1352,8 +1352,8 @@ select * from patest0 join (select f1 from int4_tbl limit 1) ss on id = f1;
 
 drop table patest0 cascade;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to table patest1
-drop cascades to table patest2
+DETAIL:  drop cascades to table patest2
+drop cascades to table patest1
 --
 -- Test merge-append plans for inheritance trees
 --
@@ -1499,9 +1499,9 @@ reset enable_seqscan;
 reset enable_parallel_append;
 drop table matest0 cascade;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table matest1
+DETAIL:  drop cascades to table matest3
 drop cascades to table matest2
-drop cascades to table matest3
+drop cascades to table matest1
 --
 -- Check that use of an index with an extraneous column doesn't produce
 -- a plan with extraneous sorting
diff --git a/src/test/regress/expected/matview.out b/src/test/regress/expected/matview.out
index 08cd4be..dc78784 100644
--- a/src/test/regress/expected/matview.out
+++ b/src/test/regress/expected/matview.out
@@ -310,15 +310,15 @@ SELECT type, m.totamt AS mtot, v.totamt AS vtot FROM mvtest_tm m LEFT JOIN mvtes
 -- make sure that dependencies are reported properly when they block the drop
 DROP TABLE mvtest_t;
 ERROR:  cannot drop table mvtest_t because other objects depend on it
-DETAIL:  view mvtest_tv depends on table mvtest_t
+DETAIL:  materialized view mvtest_tm depends on table mvtest_t
+materialized view mvtest_tmm depends on materialized view mvtest_tm
+view mvtest_tv depends on table mvtest_t
 view mvtest_tvv depends on view mvtest_tv
 materialized view mvtest_tvvm depends on view mvtest_tvv
 view mvtest_tvvmv depends on materialized view mvtest_tvvm
 materialized view mvtest_bb depends on view mvtest_tvvmv
 materialized view mvtest_mvschema.mvtest_tvm depends on view mvtest_tv
 materialized view mvtest_tvmm depends on materialized view mvtest_mvschema.mvtest_tvm
-materialized view mvtest_tm depends on table mvtest_t
-materialized view mvtest_tmm depends on materialized view mvtest_tm
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 -- make sure dependencies are dropped and reported
 -- and make sure that transactional behavior is correct on rollback
@@ -326,15 +326,15 @@ HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 BEGIN;
 DROP TABLE mvtest_t CASCADE;
 NOTICE:  drop cascades to 9 other objects
-DETAIL:  drop cascades to view mvtest_tv
+DETAIL:  drop cascades to materialized view mvtest_tm
+drop cascades to materialized view mvtest_tmm
+drop cascades to view mvtest_tv
 drop cascades to view mvtest_tvv
 drop cascades to materialized view mvtest_tvvm
 drop cascades to view mvtest_tvvmv
 drop cascades to materialized view mvtest_bb
 drop cascades to materialized view mvtest_mvschema.mvtest_tvm
 drop cascades to materialized view mvtest_tvmm
-drop cascades to materialized view mvtest_tm
-drop cascades to materialized view mvtest_tmm
 ROLLBACK;
 -- some additional tests not using base tables
 CREATE VIEW mvtest_vt1 AS SELECT 1 moo;
@@ -484,10 +484,10 @@ SELECT * FROM mvtest_mv_v_4;
 
 DROP TABLE mvtest_v CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to materialized view mvtest_mv_v
-drop cascades to materialized view mvtest_mv_v_2
+DETAIL:  drop cascades to materialized view mvtest_mv_v_4
 drop cascades to materialized view mvtest_mv_v_3
-drop cascades to materialized view mvtest_mv_v_4
+drop cascades to materialized view mvtest_mv_v_2
+drop cascades to materialized view mvtest_mv_v
 -- Check that unknown literals are converted to "text" in CREATE MATVIEW,
 -- so that we don't end up with unknown-type columns.
 CREATE MATERIALIZED VIEW mv_unspecified_types AS
diff --git a/src/test/regress/expected/rowsecurity.out b/src/test/regress/expected/rowsecurity.out
index f1ae40d..8d99213 100644
--- a/src/test/regress/expected/rowsecurity.out
+++ b/src/test/regress/expected/rowsecurity.out
@@ -3502,8 +3502,8 @@ SELECT refclassid::regclass, deptype
 SAVEPOINT q;
 DROP ROLE regress_rls_eve; --fails due to dependency on POLICY p
 ERROR:  role "regress_rls_eve" cannot be dropped because some objects depend on it
-DETAIL:  target of policy p on table tbl1
-privileges for table tbl1
+DETAIL:  privileges for table tbl1
+target of policy p on table tbl1
 ROLLBACK TO q;
 ALTER POLICY p ON tbl1 TO regress_rls_frank USING (true);
 SAVEPOINT q;
diff --git a/src/test/regress/expected/select_into.out b/src/test/regress/expected/select_into.out
index 942f975..30e10b1 100644
--- a/src/test/regress/expected/select_into.out
+++ b/src/test/regress/expected/select_into.out
@@ -46,9 +46,9 @@ CREATE TABLE selinto_schema.tmp3 (a,b,c)
 RESET SESSION AUTHORIZATION;
 DROP SCHEMA selinto_schema CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table selinto_schema.tmp1
+DETAIL:  drop cascades to table selinto_schema.tmp3
 drop cascades to table selinto_schema.tmp2
-drop cascades to table selinto_schema.tmp3
+drop cascades to table selinto_schema.tmp1
 DROP USER regress_selinto_user;
 -- Tests for WITH NO DATA and column name consistency
 CREATE TABLE ctas_base (i int, j int);
diff --git a/src/test/regress/expected/triggers.out b/src/test/regress/expected/triggers.out
index bf271d5..ac55ac5 100644
--- a/src/test/regress/expected/triggers.out
+++ b/src/test/regress/expected/triggers.out
@@ -558,9 +558,9 @@ LINE 2: FOR EACH STATEMENT WHEN (OLD.* IS DISTINCT FROM NEW.*)
 -- check dependency restrictions
 ALTER TABLE main_table DROP COLUMN b;
 ERROR:  cannot drop column b of table main_table because other objects depend on it
-DETAIL:  trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
+DETAIL:  trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
 trigger after_upd_a_b_row_trig on table main_table depends on column b of table main_table
-trigger after_upd_b_stmt_trig on table main_table depends on column b of table main_table
+trigger after_upd_b_row_trig on table main_table depends on column b of table main_table
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 -- this should succeed, but we'll roll it back to keep the triggers around
 begin;
@@ -1886,14 +1886,14 @@ select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
 (4 rows)
 
 drop trigger trg1 on trigpart1;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart1 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart1 because table trigpart1 requires it
+HINT:  You can drop table trigpart1 instead.
 drop trigger trg1 on trigpart2;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart2 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart2 because table trigpart2 requires it
+HINT:  You can drop table trigpart2 instead.
 drop trigger trg1 on trigpart3;	-- fail
-ERROR:  cannot drop trigger trg1 on table trigpart3 because trigger trg1 on table trigpart requires it
-HINT:  You can drop trigger trg1 on table trigpart instead.
+ERROR:  cannot drop trigger trg1 on table trigpart3 because table trigpart3 requires it
+HINT:  You can drop table trigpart3 instead.
 drop table trigpart2;			-- ok, trigger should be gone in that partition
 select tgrelid::regclass, tgname, tgfoid::regproc from pg_trigger
   where tgrelid::regclass::text like 'trigpart%' order by tgrelid::regclass::text;
diff --git a/src/test/regress/expected/truncate.out b/src/test/regress/expected/truncate.out
index 735d0e8..e81fa8e 100644
--- a/src/test/regress/expected/truncate.out
+++ b/src/test/regress/expected/truncate.out
@@ -278,9 +278,9 @@ SELECT * FROM trunc_faa;
 ROLLBACK;
 DROP TABLE trunc_f CASCADE;
 NOTICE:  drop cascades to 3 other objects
-DETAIL:  drop cascades to table trunc_fa
+DETAIL:  drop cascades to table trunc_fb
+drop cascades to table trunc_fa
 drop cascades to table trunc_faa
-drop cascades to table trunc_fb
 -- Test ON TRUNCATE triggers
 CREATE TABLE trunc_trigger_test (f1 int, f2 text, f3 text);
 CREATE TABLE trunc_trigger_log (tgop text, tglevel text, tgwhen text,
diff --git a/src/test/regress/expected/typed_table.out b/src/test/regress/expected/typed_table.out
index 2e47ecb..5834bec 100644
--- a/src/test/regress/expected/typed_table.out
+++ b/src/test/regress/expected/typed_table.out
@@ -77,17 +77,17 @@ CREATE TABLE persons4 OF person_type (
 ERROR:  column "name" specified more than once
 DROP TYPE person_type RESTRICT;
 ERROR:  cannot drop type person_type because other objects depend on it
-DETAIL:  table persons depends on type person_type
-function get_all_persons() depends on type person_type
+DETAIL:  table persons3 depends on type person_type
 table persons2 depends on type person_type
-table persons3 depends on type person_type
+function get_all_persons() depends on type person_type
+table persons depends on type person_type
 HINT:  Use DROP ... CASCADE to drop the dependent objects too.
 DROP TYPE person_type CASCADE;
 NOTICE:  drop cascades to 4 other objects
-DETAIL:  drop cascades to table persons
-drop cascades to function get_all_persons()
+DETAIL:  drop cascades to table persons3
 drop cascades to table persons2
-drop cascades to table persons3
+drop cascades to function get_all_persons()
+drop cascades to table persons
 CREATE TABLE persons5 OF stuff; -- only CREATE TYPE AS types may be used
 ERROR:  type stuff is not a composite type
 DROP TABLE stuff;
diff --git a/src/test/regress/expected/updatable_views.out b/src/test/regress/expected/updatable_views.out
index b34bab4..1a7b3b4 100644
--- a/src/test/regress/expected/updatable_views.out
+++ b/src/test/regress/expected/updatable_views.out
@@ -328,24 +328,10 @@ UPDATE ro_view20 SET b=upper(b);
 ERROR:  cannot update view "ro_view20"
 DETAIL:  Views that return set-returning functions are not automatically updatable.
 HINT:  To enable updating the view, provide an INSTEAD OF UPDATE trigger or an unconditional ON UPDATE DO INSTEAD rule.
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 16 other objects
-DETAIL:  drop cascades to view ro_view1
-drop cascades to view ro_view17
-drop cascades to view ro_view2
-drop cascades to view ro_view3
-drop cascades to view ro_view5
-drop cascades to view ro_view6
-drop cascades to view ro_view7
-drop cascades to view ro_view8
-drop cascades to view ro_view9
-drop cascades to view ro_view11
-drop cascades to view ro_view13
-drop cascades to view rw_view15
-drop cascades to view rw_view16
-drop cascades to view ro_view20
-drop cascades to view ro_view4
-drop cascades to view rw_view14
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 NOTICE:  drop cascades to view ro_view19
@@ -1056,8 +1042,8 @@ SELECT * FROM base_tbl;
 RESET SESSION AUTHORIZATION;
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+DETAIL:  drop cascades to view rw_view2
+drop cascades to view rw_view1
 -- nested-view permissions
 CREATE TABLE base_tbl(a int, b text, c float);
 INSERT INTO base_tbl VALUES (1, 'Row 1', 1.0);
@@ -1442,8 +1428,8 @@ SELECT events & 4 != 0 AS upd,
 DROP TABLE base_tbl CASCADE;
 NOTICE:  drop cascades to 3 other objects
 DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
 drop cascades to view rw_view3
+drop cascades to view rw_view2
 -- inheritance tests
 CREATE TABLE base_tbl_parent (a int);
 CREATE TABLE base_tbl_child (CHECK (a > 0)) INHERITS (base_tbl_parent);
@@ -1542,8 +1528,8 @@ SELECT * FROM base_tbl_child ORDER BY a;
 
 DROP TABLE base_tbl_parent, base_tbl_child CASCADE;
 NOTICE:  drop cascades to 2 other objects
-DETAIL:  drop cascades to view rw_view1
-drop cascades to view rw_view2
+DETAIL:  drop cascades to view rw_view2
+drop cascades to view rw_view1
 -- simple WITH CHECK OPTION
 CREATE TABLE base_tbl (a int, b int DEFAULT 10);
 INSERT INTO base_tbl VALUES (1,2), (2,3), (1,-1);
diff --git a/src/test/regress/output/tablespace.source b/src/test/regress/output/tablespace.source
index 2443511..ddeaa72 100644
--- a/src/test/regress/output/tablespace.source
+++ b/src/test/regress/output/tablespace.source
@@ -242,10 +242,10 @@ NOTICE:  no matching relations in tablespace "regress_tblspace_renamed" found
 DROP TABLESPACE regress_tblspace_renamed;
 DROP SCHEMA testschema CASCADE;
 NOTICE:  drop cascades to 5 other objects
-DETAIL:  drop cascades to table testschema.foo
-drop cascades to table testschema.asselect
-drop cascades to table testschema.asexecute
+DETAIL:  drop cascades to table testschema.tablespace_acl
 drop cascades to table testschema.atable
-drop cascades to table testschema.tablespace_acl
+drop cascades to table testschema.asexecute
+drop cascades to table testschema.asselect
+drop cascades to table testschema.foo
 DROP ROLE regress_tablespace_user1;
 DROP ROLE regress_tablespace_user2;
diff --git a/src/test/regress/sql/updatable_views.sql b/src/test/regress/sql/updatable_views.sql
index a7786b2..deb7654 100644
--- a/src/test/regress/sql/updatable_views.sql
+++ b/src/test/regress/sql/updatable_views.sql
@@ -98,7 +98,9 @@ DELETE FROM ro_view18;
 UPDATE ro_view19 SET last_value=1000;
 UPDATE ro_view20 SET b=upper(b);
 
+\set VERBOSITY terse
 DROP TABLE base_tbl CASCADE;
+\set VERBOSITY default
 DROP VIEW ro_view10, ro_view12, ro_view18;
 DROP SEQUENCE uv_seq CASCADE;
 
-- 
2.7.4

>From 0eca6a610245a41e36c70dd18fb11df779c28a5e Mon Sep 17 00:00:00 2001
From: "Andrey V. Lepikhov" <a.lepik...@postgrespro.ru>
Date: Mon, 2 Jul 2018 16:27:53 +0500
Subject: [PATCH 4/4] Retail IndexTuple Deletion with TID sorting in leaf
 nbtree pages

---
 src/backend/access/nbtree/nbtree.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index c54aeac..041ce6f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -887,15 +887,6 @@ btbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 	return stats;
 }
 
-static int
-tid_list_search(ItemPointer tid, ItemPointer tid_list, int ntid)
-{
-	for (int i = 0; i < ntid; i++)
-		if (ItemPointerEquals(tid, &(tid_list[i])))
-			return i;
-	return -1;
-}
-
 IndexTargetDeleteResult*
 bttargetdelete(IndexTargetDeleteInfo *info,
 			   IndexTargetDeleteResult *stats,
@@ -914,6 +905,7 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 	int				ndeletable = 0;
 	OffsetNumber	deletable[MaxOffsetNumber];
 	IndexTuple		itup;
+	int				pos = 0;
 
 	if (stats == NULL)
 		stats = (IndexTargetDeleteResult *) palloc0(sizeof(IndexTargetDeleteResult));
@@ -922,12 +914,12 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 	skey = _bt_mkscankey(irel, itup);
 
 	/* Descend the tree and position ourselves on the target leaf page. */
-	stack = _bt_search(irel, keysCount, skey, false, &buf, BT_READ, NULL);
+	stack = _bt_search(irel, keysCount, skey, NULL, false, &buf, BT_READ, NULL);
 	_bt_freestack(stack);
 
 	/* To prepare tuple entries search across index pages */
 	Assert(BufferIsValid(buf));
-	offnum = _bt_binsrch(irel, buf, keysCount, skey, false);
+	offnum = _bt_binsrch(irel, buf, keysCount, skey, NULL, false);
 	page = BufferGetPage(buf);
 	_bt_checkpage(irel, buf);
 	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -937,7 +929,6 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 		int32		cmpval;
 		ItemId		itemid;
 		IndexTuple	itup;
-		int			pos;
 
 		/* Switch to the next page */
 		if (P_IGNORE(opaque) || (offnum > PageGetMaxOffsetNumber(page)))
@@ -978,7 +969,7 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 		/*
 		 * This index entry satisfied to the scan key?
 		 */
-		cmpval = _bt_compare(irel, keysCount, skey, page, offnum);
+		cmpval = _bt_compare(irel, keysCount, skey, NULL, page, offnum);
 
 		if (cmpval != 0)
 			/* End of index entries, satisfied to the scan key */
@@ -990,13 +981,23 @@ bttargetdelete(IndexTargetDeleteInfo *info,
 		 */
 		itemid = PageGetItemId(page, offnum);
 		itup = (IndexTuple) PageGetItem(page, itemid);
-		pos = tid_list_search(&(itup->t_tid), info->dead_tuples, info->num_dead_tuples);
 
-		if ((pos >= 0) && (!info->found_dead_tuples[pos]))
+		/*
+		 * Search for next TID from presorted btree result comparable
+		 * to TID from presorted dead_tuples tid list
+		 */
+		while (pos < info->num_dead_tuples)
 		{
-			/* index entry for TID of dead tuple is found */
-			deletable[ndeletable++] = offnum;
-			info->found_dead_tuples[pos] = true;
+			int res = ItemPointerCompare(&(itup->t_tid), &info->dead_tuples[pos]);
+			if ((res == 0) && (!info->found_dead_tuples[pos]))
+			{
+				/* index entry for TID of dead tuple is found */
+				deletable[ndeletable++] = offnum;
+				info->found_dead_tuples[pos] = true;
+			}
+			else if (res < 0)
+				break;
+			pos++;
 		}
 
 		offnum = OffsetNumberNext(offnum);
-- 
2.7.4

Re: [WIP] [B-Tree] Retail IndexTuple deletion

Reply via email to