Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Anastasia Lubennikova Mon, 16 Sep 2019 08:49:23 -0700

13.09.2019 4:04, Peter Geoghegan wrote:

On Wed, Sep 11, 2019 at 2:04 PM Peter Geoghegan <p...@bowt.ie> wrote:

I think that the new WAL record has to be created once per posting
list that is generated, not once per page that is deduplicated --
that's the only way that I can see that avoids a huge increase in
total WAL volume. Even if we assume that I am wrong about there being
value in making deduplication incremental, it is still necessary to
make the WAL-logging behave incrementally.

It would be good to hear your thoughts on this _bt_dedup_one_page()
WAL volume/"write amplification" issue.

Attached is v14 based on v12 (v13 changes are not merged).

In this version, I fixed the bug you mentioned and also fixed nbtinsert,
so that it doesn't save newposting in xlog record anymore.

I tested patch with nbtree_wal_test, and found out that the real issue is

not the dedup WAL records themselves, but the full page writes that theytrigger.Here are test results (config is standard, except fsync=off to speeduptests):


'FPW on' and 'FPW off' are tests on v14.
NO_IMAGE is the test on v14 with REGBUF_NO_IMAGE in bt_dedup_one_page().

+-------------------+-----------+-----------+----------------+-----------+

|        ---        |   FPW on  |  FPW off  | FORCE_NO_IMAGE |   master  |

+-------------------+-----------+-----------+----------------+-----------+

| time              | 09:12 min | 06:56 min | 06:24 min      | 08:10 min |

| nbtree_wal_volume | 8083 MB   | 2128 MB   | 2327 MB        | 2439 MB   |

| index_size        | 169 MB    | 169 MB    | 169 MB         | 1118 MB   |

+-------------------+-----------+-----------+----------------+-----------+

With random insertions into btree it's highly possible thatdeduplication will often bethe first write after checkpoint, and thus will trigger FPW, even ifonly a few tuples were compressed.That's why there is no significant difference with log_newpage_buffer()approach.

And that's why "lazy" deduplication doesn't help to decrease amount of WAL.

Also, since the index is packed way better than before, it probablybenefits less of wal_compression.

One possible "fix" to decrease WAL amplification is to addREGBUF_NO_IMAGE flag to XLogRegisterBuffer in bt_dedup_one_page().As you can see from test result, it really eliminates the problem ofinadequate WAL amount.

However, I doubt that it is a crash-safe idea.

Another, and more realistic approach is to make deduplication lessintensive:if freed space is less than some threshold, fall back to not changingpage at all and not generating xlog record.

Probably that was the reason, why patch became faster after I addedBT_COMPRESS_THRESHOLD in early versions,not because deduplication itself is cpu bound or something, but becauseWAL load decreased.


So I propose to develop this idea. The question is how to choose threshold.
I wouldn't like to introduce new user settings.  Any ideas?

I also noticed that the number of checkpoints differ between tests:
select checkpoints_req from pg_stat_bgwriter ;

+-----------------+---------+---------+----------------+--------+

|       ---       |  FPW on | FPW off | FORCE_NO_IMAGE | master |

+-----------------+---------+---------+----------------+--------+

| checkpoints_req |      16 |       7 |              8 |     10 |

+-----------------+---------+---------+----------------+--------+

And I struggle to explain the reason of this.
Do you understand what can cause the difference?

--
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/amcheck/verify_nbtree.c b/contrib/amcheck/verify_nbtree.c
index 05e7d67..399743d 100644
--- a/contrib/amcheck/verify_nbtree.c
+++ b/contrib/amcheck/verify_nbtree.c
@@ -924,6 +924,7 @@ bt_target_page_check(BtreeCheckState *state)
 		size_t		tupsize;
 		BTScanInsert skey;
 		bool		lowersizelimit;
+		ItemPointer	scantid;
 
 		CHECK_FOR_INTERRUPTS();
 
@@ -994,29 +995,73 @@ bt_target_page_check(BtreeCheckState *state)
 
 		/*
 		 * Readonly callers may optionally verify that non-pivot tuples can
-		 * each be found by an independent search that starts from the root
+		 * each be found by an independent search that starts from the root.
+		 * Note that we deliberately don't do individual searches for each
+		 * "logical" posting list tuple, since the posting list itself is
+		 * validated by other checks.
 		 */
 		if (state->rootdescend && P_ISLEAF(topaque) &&
 			!bt_rootdescend(state, itup))
 		{
 			char	   *itid,
 					   *htid;
+			ItemPointer tid = BTreeTupleGetHeapTID(itup);
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumber(&(itup->t_tid)),
-							ItemPointerGetOffsetNumber(&(itup->t_tid)));
+							ItemPointerGetBlockNumber(tid),
+							ItemPointerGetOffsetNumber(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
 					 errmsg("could not find tuple using search from root page in index \"%s\"",
 							RelationGetRelationName(state->rel)),
-					 errdetail_internal("Index tid=%s points to heap tid=%s page lsn=%X/%X.",
+					 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
 										itid, htid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
 
+		/*
+		 * If tuple is actually a posting list, make sure posting list TIDs
+		 * are in order.
+		 */
+		if (BTreeTupleIsPosting(itup))
+		{
+			ItemPointerData last;
+			ItemPointer		current;
+
+			ItemPointerCopy(BTreeTupleGetHeapTID(itup), &last);
+
+			for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+			{
+
+				current = BTreeTupleGetPostingN(itup, i);
+
+				if (ItemPointerCompare(current, &last) <= 0)
+				{
+					char	   *itid,
+							   *htid;
+
+					itid = psprintf("(%u,%u)", state->targetblock, offset);
+					htid = psprintf("(%u,%u)",
+									ItemPointerGetBlockNumberNoCheck(current),
+									ItemPointerGetOffsetNumberNoCheck(current));
+
+					ereport(ERROR,
+							(errcode(ERRCODE_INDEX_CORRUPTED),
+							 errmsg("posting list heap TIDs out of order in index \"%s\"",
+									RelationGetRelationName(state->rel)),
+							 errdetail_internal("Index tid=%s min heap tid=%s page lsn=%X/%X.",
+												itid, htid,
+												(uint32) (state->targetlsn >> 32),
+												(uint32) state->targetlsn)));
+				}
+
+				ItemPointerCopy(current, &last);
+			}
+		}
+
 		/* Build insertion scankey for current page offset */
 		skey = bt_mkscankey_pivotsearch(state->rel, itup);
 
@@ -1074,12 +1119,33 @@ bt_target_page_check(BtreeCheckState *state)
 		{
 			IndexTuple	norm;
 
-			norm = bt_normalize_tuple(state, itup);
-			bloom_add_element(state->filter, (unsigned char *) norm,
-							  IndexTupleSize(norm));
-			/* Be tidy */
-			if (norm != itup)
-				pfree(norm);
+			if (BTreeTupleIsPosting(itup))
+			{
+				IndexTuple	onetup;
+
+				/* Fingerprint all elements of posting tuple one by one */
+				for (int i = 0; i < BTreeTupleGetNPosting(itup); i++)
+				{
+					onetup = BTreeGetNthTupleOfPosting(itup, i);
+
+					norm = bt_normalize_tuple(state, onetup);
+					bloom_add_element(state->filter, (unsigned char *) norm,
+									  IndexTupleSize(norm));
+					/* Be tidy */
+					if (norm != onetup)
+						pfree(norm);
+					pfree(onetup);
+				}
+			}
+			else
+			{
+				norm = bt_normalize_tuple(state, itup);
+				bloom_add_element(state->filter, (unsigned char *) norm,
+								  IndexTupleSize(norm));
+				/* Be tidy */
+				if (norm != itup)
+					pfree(norm);
+			}
 		}
 
 		/*
@@ -1087,7 +1153,8 @@ bt_target_page_check(BtreeCheckState *state)
 		 *
 		 * If there is a high key (if this is not the rightmost page on its
 		 * entire level), check that high key actually is upper bound on all
-		 * page items.
+		 * page items.  If this is a posting list tuple, we'll need to set
+		 * scantid to be highest TID in posting list.
 		 *
 		 * We prefer to check all items against high key rather than checking
 		 * just the last and trusting that the operator class obeys the
@@ -1127,6 +1194,9 @@ bt_target_page_check(BtreeCheckState *state)
 		 * tuple. (See also: "Notes About Data Representation" in the nbtree
 		 * README.)
 		 */
+		scantid = skey->scantid;
+		if (state->heapkeyspace && !BTreeTupleIsPivot(itup))
+			skey->scantid = BTreeTupleGetMaxTID(itup);
 		if (!P_RIGHTMOST(topaque) &&
 			!(P_ISLEAF(topaque) ? invariant_leq_offset(state, skey, P_HIKEY) :
 			  invariant_l_offset(state, skey, P_HIKEY)))
@@ -1150,6 +1220,7 @@ bt_target_page_check(BtreeCheckState *state)
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
 		}
+		skey->scantid = scantid;
 
 		/*
 		 * * Item order check *
@@ -1164,11 +1235,13 @@ bt_target_page_check(BtreeCheckState *state)
 					   *htid,
 					   *nitid,
 					   *nhtid;
+			ItemPointer tid;
 
 			itid = psprintf("(%u,%u)", state->targetblock, offset);
+			tid = BTreeTupleGetHeapTID(itup);
 			htid = psprintf("(%u,%u)",
-							ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							ItemPointerGetBlockNumberNoCheck(tid),
+							ItemPointerGetOffsetNumberNoCheck(tid));
 			nitid = psprintf("(%u,%u)", state->targetblock,
 							 OffsetNumberNext(offset));
 
@@ -1177,9 +1250,11 @@ bt_target_page_check(BtreeCheckState *state)
 										  state->target,
 										  OffsetNumberNext(offset));
 			itup = (IndexTuple) PageGetItem(state->target, itemid);
+
+			tid = BTreeTupleGetHeapTID(itup);
 			nhtid = psprintf("(%u,%u)",
-							 ItemPointerGetBlockNumberNoCheck(&(itup->t_tid)),
-							 ItemPointerGetOffsetNumberNoCheck(&(itup->t_tid)));
+							 ItemPointerGetBlockNumberNoCheck(tid),
+							 ItemPointerGetOffsetNumberNoCheck(tid));
 
 			ereport(ERROR,
 					(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -1189,10 +1264,10 @@ bt_target_page_check(BtreeCheckState *state)
 										"higher index tid=%s (points to %s tid=%s) "
 										"page lsn=%X/%X.",
 										itid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										htid,
 										nitid,
-										P_ISLEAF(topaque) ? "heap" : "index",
+										P_ISLEAF(topaque) ? "min heap" : "index",
 										nhtid,
 										(uint32) (state->targetlsn >> 32),
 										(uint32) state->targetlsn)));
@@ -1953,10 +2028,10 @@ bt_tuple_present_callback(Relation index, HeapTuple htup, Datum *values,
  * verification.  In particular, it won't try to normalize opclass-equal
  * datums with potentially distinct representations (e.g., btree/numeric_ops
  * index datums will not get their display scale normalized-away here).
- * Normalization may need to be expanded to handle more cases in the future,
- * though.  For example, it's possible that non-pivot tuples could in the
- * future have alternative logically equivalent representations due to using
- * the INDEX_ALT_TID_MASK bit to implement intelligent deduplication.
+ * Caller does normalization for non-pivot tuples that have a posting list,
+ * since dummy CREATE INDEX callback code generates new tuples with the same
+ * normalized representation.  Deduplication is performed opportunistically,
+ * and in general there is no guarantee about how or when it will be applied.
  */
 static IndexTuple
 bt_normalize_tuple(BtreeCheckState *state, IndexTuple itup)
@@ -2087,6 +2162,7 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		insertstate.itup = itup;
 		insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 		insertstate.itup_key = key;
+		insertstate.in_posting_offset = 0;
 		insertstate.bounds_valid = false;
 		insertstate.buf = lbuf;
 
@@ -2094,7 +2170,9 @@ bt_rootdescend(BtreeCheckState *state, IndexTuple itup)
 		offnum = _bt_binsrch_insert(state->rel, &insertstate);
 		/* Compare first >= matching item on leaf page, if any */
 		page = BufferGetPage(lbuf);
+		/* Should match on first heap TID when tuple has a posting list */
 		if (offnum <= PageGetMaxOffsetNumber(page) &&
+			insertstate.in_posting_offset <= 0 &&
 			_bt_compare(state->rel, key, page, offnum) == 0)
 			exists = true;
 		_bt_relbuf(state->rel, lbuf);
@@ -2560,14 +2638,18 @@ static inline ItemPointer
 BTreeTupleGetHeapTIDCareful(BtreeCheckState *state, IndexTuple itup,
 							bool nonpivot)
 {
-	ItemPointer result = BTreeTupleGetHeapTID(itup);
+	ItemPointer result;
 	BlockNumber targetblock = state->targetblock;
 
-	if (result == NULL && nonpivot)
+	/* Shouldn't be called with heapkeyspace index */
+	Assert(state->heapkeyspace);
+	if (BTreeTupleIsPivot(itup) == nonpivot)
 		ereport(ERROR,
 				(errcode(ERRCODE_INDEX_CORRUPTED),
 				 errmsg("block %u or its right sibling block or child block in index \"%s\" contains non-pivot tuple that lacks a heap TID",
 						targetblock, RelationGetRelationName(state->rel))));
 
+	result = BTreeTupleGetHeapTID(itup);
+
 	return result;
 }
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 6db203e..50ec9ef 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -432,7 +432,10 @@ because we allow LP_DEAD to be set with only a share lock (it's exactly
 like a hint bit for a heap tuple), but physically removing tuples requires
 exclusive lock.  In the current code we try to remove LP_DEAD tuples when
 we are otherwise faced with having to split a page to do an insertion (and
-hence have exclusive lock on it already).
+hence have exclusive lock on it already).  Deduplication can also prevent
+a page split, but removing LP_DEAD tuples is the preferred approach.
+(Note that posting list tuples can only have their LP_DEAD bit set when
+every "logical" tuple represented within the posting list is known dead.)
 
 This leaves the index in a state where it has no entry for a dead tuple
 that still exists in the heap.  This is not a problem for the current
@@ -710,6 +713,77 @@ the fallback strategy assumes that duplicates are mostly inserted in
 ascending heap TID order.  The page is split in a way that leaves the left
 half of the page mostly full, and the right half of the page mostly empty.
 
+Notes about deduplication
+-------------------------
+
+We deduplicate non-pivot tuples in non-unique indexes to reduce storage
+overhead, and to avoid or at least delay page splits.  Deduplication alters
+the physical representation of tuples without changing the logical contents
+of the index, and without adding overhead to read queries.  Non-pivot
+tuples are folded together into a single physical tuple with a posting list
+(a simple array of heap TIDs with the standard item pointer format).
+Deduplication is always applied lazily, at the point where it would
+otherwise be necessary to perform a page split.  It occurs only when
+LP_DEAD items have been removed, as our last line of defense against
+splitting a leaf page.  We can set the LP_DEAD bit with posting list
+tuples, though only when all table tuples are known dead. (Bitmap scans
+cannot perform LP_DEAD bit setting, and are the common case with indexes
+that contain lots of duplicates, so this downside is considered
+acceptable.)
+
+Large groups of logical duplicates tend to appear together on the same leaf
+page due to the special duplicate logic used when choosing a split point.
+This facilitates lazy/dynamic deduplication.  Deduplication can reliably
+deduplicate a large localized group of duplicates before it can span
+multiple leaf pages.  Posting list tuples are subject to the same 1/3 of a
+page restriction as any other tuple.
+
+Lazy deduplication allows the page space accounting used during page splits
+to have absolutely minimal special case logic for posting lists.  A posting
+list can be thought of as extra payload that suffix truncation will
+reliably truncate away as needed during page splits, just like non-key
+columns from an INCLUDE index tuple.  An incoming tuple (which might cause
+a page split) can always be thought of as a non-posting-list tuple that
+must be inserted alongside existing items, without needing to consider
+deduplication.  Most of the time, that's what actually happens: incoming
+tuples are either not duplicates, or are duplicates with a heap TID that
+doesn't overlap with any existing posting list tuple (lazy deduplication
+avoids rewriting posting lists repeatedly when heap TIDs are inserted
+slightly out of order by concurrent inserters).  When the incoming tuple
+really does overlap with an existing posting list, a posting list split is
+performed.  Posting list splits work in a way that more or less preserves
+the illusion that all incoming tuples do not need to be merged with any
+existing posting list tuple.
+
+Posting list splits work by "overriding" the details of the incoming tuple.
+The heap TID of the incoming tuple is altered to make it match the
+rightmost heap TID from the existing/originally overlapping posting list.
+The offset number that the new/incoming tuple is to be inserted at is
+incremented so that it will be inserted to the right of the existing
+posting list.  The insertion (or page split) operation that completes the
+insert does one extra step: an in-place update of the posting list.  The
+update changes the posting list such that the "true" heap TID from the
+original incoming tuple is now contained in the posting list.  We make
+space in the posting list by removing the heap TID that became the new
+item.  The size of the posting list won't change, and so the page split
+space accounting does not need to care about posting lists.  Also, overall
+space utilization is improved by keeping existing posting lists large.
+
+The representation of posting lists is identical to the posting lists used
+by GIN, so it would be straightforward to apply GIN's varbyte encoding
+compression scheme to individual posting lists.  Posting list compression
+would break the assumptions made by posting list splits about page space
+accounting, though, so it's not clear how compression could be integrated
+with nbtree.  Besides, posting list compression does not offer a compelling
+trade-off for nbtree, since in general nbtree is optimized for consistent
+performance with many concurrent readers and writers.  A major goal of
+nbtree's lazy approach to deduplication is to limit the performance impact
+of deduplication with random updates.  Even concurrent append-only inserts
+of the same key value will tend to have inserts of individual index tuples
+in an order that doesn't quite match heap TID order.  In general, delaying
+deduplication avoids many unnecessary posting list splits, and minimizes
+page level fragmentation.
+
 Notes About Data Representation
 -------------------------------
 
diff --git a/src/backend/access/nbtree/nbtinsert.c b/src/backend/access/nbtree/nbtinsert.c
index b84bf1c..605865e 100644
--- a/src/backend/access/nbtree/nbtinsert.c
+++ b/src/backend/access/nbtree/nbtinsert.c
@@ -47,21 +47,26 @@ static void _bt_insertonpg(Relation rel, BTScanInsert itup_key,
 						   BTStack stack,
 						   IndexTuple itup,
 						   OffsetNumber newitemoff,
+						   int in_posting_offset,
 						   bool split_only_page);
 static Buffer _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf,
 						Buffer cbuf, OffsetNumber newitemoff, Size newitemsz,
-						IndexTuple newitem);
+						IndexTuple newitem, IndexTuple original_newitem, IndexTuple nposting,
+						OffsetNumber in_posting_offset);
 static void _bt_insert_parent(Relation rel, Buffer buf, Buffer rbuf,
 							  BTStack stack, bool is_root, bool is_only);
 static bool _bt_pgaddtup(Page page, Size itemsize, IndexTuple itup,
 						 OffsetNumber itup_off);
 static void _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel);
+static void _bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel,
+							   Size itemsz);
 
 /*
  *	_bt_doinsert() -- Handle insertion of a single index tuple in the tree.
  *
  *		This routine is called by the public interface routine, btinsert.
- *		By here, itup is filled in, including the TID.
+ *		By here, itup is filled in, including the TID.  Caller should be
+ *		prepared for us to scribble on 'itup'.
  *
  *		If checkUnique is UNIQUE_CHECK_NO or UNIQUE_CHECK_PARTIAL, this
  *		will allow duplicates.  Otherwise (UNIQUE_CHECK_YES or
@@ -123,6 +128,7 @@ _bt_doinsert(Relation rel, IndexTuple itup,
 	/* PageAddItem will MAXALIGN(), but be consistent */
 	insertstate.itemsz = MAXALIGN(IndexTupleSize(itup));
 	insertstate.itup_key = itup_key;
+	insertstate.in_posting_offset = 0;
 	insertstate.bounds_valid = false;
 	insertstate.buf = InvalidBuffer;
 
@@ -300,7 +306,7 @@ top:
 		newitemoff = _bt_findinsertloc(rel, &insertstate, checkingunique,
 									   stack, heapRel);
 		_bt_insertonpg(rel, itup_key, insertstate.buf, InvalidBuffer, stack,
-					   itup, newitemoff, false);
+					   itup, newitemoff, insertstate.in_posting_offset, false);
 	}
 	else
 	{
@@ -435,6 +441,7 @@ _bt_check_unique(Relation rel, BTInsertState insertstate, Relation heapRel,
 
 				/* okay, we gotta fetch the heap tuple ... */
 				curitup = (IndexTuple) PageGetItem(page, curitemid);
+				Assert(!BTreeTupleIsPosting(curitup));
 				htid = curitup->t_tid;
 
 				/*
@@ -689,6 +696,7 @@ _bt_findinsertloc(Relation rel,
 	BTScanInsert itup_key = insertstate->itup_key;
 	Page		page = BufferGetPage(insertstate->buf);
 	BTPageOpaque lpageop;
+	OffsetNumber location;
 
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
 
@@ -751,13 +759,23 @@ _bt_findinsertloc(Relation rel,
 
 		/*
 		 * If the target page is full, see if we can obtain enough space by
-		 * erasing LP_DEAD items
+		 * erasing LP_DEAD items.  If that doesn't work out, and if the index
+		 * isn't a unique index, try deduplication.
 		 */
-		if (PageGetFreeSpace(page) < insertstate->itemsz &&
-			P_HAS_GARBAGE(lpageop))
+		if (PageGetFreeSpace(page) < insertstate->itemsz)
 		{
-			_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
-			insertstate->bounds_valid = false;
+			if (P_HAS_GARBAGE(lpageop))
+			{
+				_bt_vacuum_one_page(rel, insertstate->buf, heapRel);
+				insertstate->bounds_valid = false;
+			}
+
+			if (!checkingunique && PageGetFreeSpace(page) < insertstate->itemsz)
+			{
+				_bt_dedup_one_page(rel, insertstate->buf, heapRel,
+								   insertstate->itemsz);
+				insertstate->bounds_valid = false;	/* paranoia */
+			}
 		}
 	}
 	else
@@ -839,7 +857,31 @@ _bt_findinsertloc(Relation rel,
 	Assert(P_RIGHTMOST(lpageop) ||
 		   _bt_compare(rel, itup_key, page, P_HIKEY) <= 0);
 
-	return _bt_binsrch_insert(rel, insertstate);
+	location = _bt_binsrch_insert(rel, insertstate);
+
+	/*
+	 * Insertion is not prepared for the case where an LP_DEAD posting list
+	 * tuple must be split.  In the unlikely event that this happens, call
+	 * _bt_dedup_one_page() to force it to kill all LP_DEAD items.
+	 */
+	if (unlikely(insertstate->in_posting_offset == -1))
+	{
+		_bt_dedup_one_page(rel, insertstate->buf, heapRel, 0);
+		Assert(!P_HAS_GARBAGE(lpageop));
+
+		/* Must reset insertstate ahead of new _bt_binsrch_insert() call */
+		insertstate->bounds_valid = false;
+		insertstate->in_posting_offset = 0;
+		location = _bt_binsrch_insert(rel, insertstate);
+
+		/*
+		 * Might still have to split some other posting list now, but that
+		 * should never be LP_DEAD
+		 */
+		Assert(insertstate->in_posting_offset >= 0);
+	}
+
+	return location;
 }
 
 /*
@@ -900,15 +942,65 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
 	insertstate->bounds_valid = false;
 }
 
+/*
+ * If the new tuple 'itup' is a duplicate with a heap TID that falls inside
+ * the range of an existing posting list tuple 'oposting', generate new
+ * posting tuple to replace original one and update new tuple so that
+ * it's heap TID contains the rightmost heap TID of original posting tuple.
+ */
+IndexTuple
+_bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset)
+{
+	int			nipd;
+	char	   *replacepos;
+	char	   *rightpos;
+	Size		nbytes;
+	IndexTuple nposting;
+
+	Assert(BTreeTupleIsPosting(oposting));
+	nipd = BTreeTupleGetNPosting(oposting);
+	Assert(in_posting_offset < nipd);
+
+	nposting = CopyIndexTuple(oposting);
+	replacepos = (char *) BTreeTupleGetPostingN(nposting, in_posting_offset);
+	rightpos = replacepos + sizeof(ItemPointerData);
+	nbytes = (nipd - in_posting_offset - 1) * sizeof(ItemPointerData);
+
+	/*
+	 * Move item pointers in posting list to make a gap for the new item's
+	 * heap TID (shift TIDs one place to the right, losing original
+	 * rightmost TID).
+	 */
+	memmove(rightpos, replacepos, nbytes);
+
+	/*
+	 * Fill the gap with the TID of the new item.
+	 */
+	ItemPointerCopy(&itup->t_tid, (ItemPointer) replacepos);
+
+	/*
+	 * Copy original (not new original) posting list's last TID into new
+	 * item
+	 */
+	ItemPointerCopy(BTreeTupleGetPostingN(oposting, nipd - 1), &itup->t_tid);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(nposting),
+								BTreeTupleGetHeapTID(itup)) < 0);
+
+	return nposting;
+}
+
 /*----------
  *	_bt_insertonpg() -- Insert a tuple on a particular page in the index.
  *
  *		This recursive procedure does the following things:
  *
+ *			+  if necessary, splits an existing posting list on page.
+ *			   This is only needed when 'in_posting_offset' is non-zero.
  *			+  if necessary, splits the target page, using 'itup_key' for
  *			   suffix truncation on leaf pages (caller passes NULL for
  *			   non-leaf pages).
- *			+  inserts the tuple.
+ *			+  inserts the new tuple (could be from split posting list).
  *			+  if the page was split, pops the parent stack, and finds the
  *			   right place to insert the new child pointer (by walking
  *			   right using information stored in the parent stack).
@@ -918,7 +1010,8 @@ _bt_stepright(Relation rel, BTInsertState insertstate, BTStack stack)
  *
  *		On entry, we must have the correct buffer in which to do the
  *		insertion, and the buffer must be pinned and write-locked.  On return,
- *		we will have dropped both the pin and the lock on the buffer.
+ *		we will have dropped both the pin and the lock on the buffer.  Caller
+ *		should be prepared for us to scribble on 'itup'.
  *
  *		This routine only performs retail tuple insertions.  'itup' should
  *		always be either a non-highkey leaf item, or a downlink (new high
@@ -936,11 +1029,15 @@ _bt_insertonpg(Relation rel,
 			   BTStack stack,
 			   IndexTuple itup,
 			   OffsetNumber newitemoff,
+			   int in_posting_offset,
 			   bool split_only_page)
 {
 	Page		page;
 	BTPageOpaque lpageop;
 	Size		itemsz;
+	IndexTuple	nposting = NULL;
+	IndexTuple	oposting;
+	IndexTuple	original_itup = NULL;
 
 	page = BufferGetPage(buf);
 	lpageop = (BTPageOpaque) PageGetSpecialPointer(page);
@@ -954,6 +1051,8 @@ _bt_insertonpg(Relation rel,
 	Assert(P_ISLEAF(lpageop) ||
 		   BTreeTupleGetNAtts(itup, rel) <=
 		   IndexRelationGetNumberOfKeyAttributes(rel));
+	/* retail insertions of posting list tuples are disallowed */
+	Assert(!BTreeTupleIsPosting(itup));
 
 	/* The caller should've finished any incomplete splits already. */
 	if (P_INCOMPLETE_SPLIT(lpageop))
@@ -965,6 +1064,47 @@ _bt_insertonpg(Relation rel,
 								 * need to be consistent */
 
 	/*
+	 * Do we need to split an existing posting list item?
+	 */
+	if (in_posting_offset != 0)
+	{
+		ItemId		itemid = PageGetItemId(page, newitemoff);
+
+		/*
+		 * The new tuple is a duplicate with a heap TID that falls inside the
+		 * range of an existing posting list tuple, so split posting list.
+		 *
+		 * Posting list splits always replace some existing TID in the posting
+		 * list with the new item's heap TID (based on a posting list offset
+		 * from caller) by removing rightmost heap TID from posting list.  The
+		 * new item's heap TID is swapped with that rightmost heap TID, almost
+		 * as if the tuple inserted never overlapped with a posting list in
+		 * the first place.  This allows the insertion and page split code to
+		 * have minimal special case handling of posting lists.
+		 *
+		 * The only extra handling required is to overwrite the original
+		 * posting list with nposting, which is guaranteed to be the same size
+		 * as the original, keeping the page space accounting simple.  This
+		 * takes place in either the page insert or page split critical
+		 * section.
+		 */
+		Assert(P_ISLEAF(lpageop));
+		Assert(!ItemIdIsDead(itemid));
+		Assert(in_posting_offset > 0);
+		oposting = (IndexTuple) PageGetItem(page, itemid);
+
+		/* save a copy of itup with unchanged TID to write it into xlog record */
+		original_itup = CopyIndexTuple(itup);
+
+		nposting = _bt_form_newposting(itup, oposting, in_posting_offset);
+
+		Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+
+		/* Alter new item offset, since effective new item changed */
+		newitemoff = OffsetNumberNext(newitemoff);
+	}
+
+	/*
 	 * Do we need to split the page to fit the item on it?
 	 *
 	 * Note: PageGetFreeSpace() subtracts sizeof(ItemIdData) from its result,
@@ -996,7 +1136,8 @@ _bt_insertonpg(Relation rel,
 				 BlockNumberIsValid(RelationGetTargetBlock(rel))));
 
 		/* split the buffer into left and right halves */
-		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup);
+		rbuf = _bt_split(rel, itup_key, buf, cbuf, newitemoff, itemsz, itup,
+						 original_itup, nposting, in_posting_offset);
 		PredicateLockPageSplit(rel,
 							   BufferGetBlockNumber(buf),
 							   BufferGetBlockNumber(rbuf));
@@ -1075,6 +1216,18 @@ _bt_insertonpg(Relation rel,
 			elog(PANIC, "failed to add new item to block %u in index \"%s\"",
 				 itup_blkno, RelationGetRelationName(rel));
 
+		if (nposting)
+		{
+			/*
+			 * Handle a posting list split by performing an in-place update of
+			 * the existing posting list
+			 */
+			Assert(P_ISLEAF(lpageop));
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+		}
+
 		MarkBufferDirty(buf);
 
 		if (BufferIsValid(metabuf))
@@ -1116,6 +1269,7 @@ _bt_insertonpg(Relation rel,
 			XLogRecPtr	recptr;
 
 			xlrec.offnum = itup_off;
+			xlrec.in_posting_offset = in_posting_offset;
 
 			XLogBeginInsert();
 			XLogRegisterData((char *) &xlrec, SizeOfBtreeInsert);
@@ -1152,7 +1306,10 @@ _bt_insertonpg(Relation rel,
 			}
 
 			XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
-			XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
+			if (original_itup)
+				XLogRegisterBufData(0, (char *) original_itup, IndexTupleSize(original_itup));
+			else
+				XLogRegisterBufData(0, (char *) itup, IndexTupleSize(itup));
 
 			recptr = XLogInsert(RM_BTREE_ID, xlinfo);
 
@@ -1194,6 +1351,13 @@ _bt_insertonpg(Relation rel,
 			_bt_getrootheight(rel) >= BTREE_FASTPATH_MIN_LEVEL)
 			RelationSetTargetBlock(rel, cachedBlock);
 	}
+
+	/* be tidy */
+	if (nposting)
+		pfree(nposting);
+	if (original_itup)
+		pfree(original_itup);
+
 }
 
 /*
@@ -1211,10 +1375,17 @@ _bt_insertonpg(Relation rel,
  *
  *		Returns the new right sibling of buf, pinned and write-locked.
  *		The pin and lock on buf are maintained.
+ *
+ *		nposting is a replacement posting for the posting list at the
+ *		offset immediately before the new item's offset.  This is needed
+ *		when caller performed "posting list split", and corresponds to the
+ *		same step for retail insertions that don't split the page.
  */
 static Buffer
 _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
-		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem)
+		  OffsetNumber newitemoff, Size newitemsz, IndexTuple newitem,
+		  IndexTuple original_newitem,
+		  IndexTuple nposting, OffsetNumber in_posting_offset)
 {
 	Buffer		rbuf;
 	Page		origpage;
@@ -1236,6 +1407,7 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	OffsetNumber firstright;
 	OffsetNumber maxoff;
 	OffsetNumber i;
+	OffsetNumber replacepostingoff = InvalidOffsetNumber;
 	bool		newitemonleft,
 				isleaf;
 	IndexTuple	lefthikey;
@@ -1243,6 +1415,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	int			indnkeyatts = IndexRelationGetNumberOfKeyAttributes(rel);
 
 	/*
+	 * Determine offset number of posting list that will be updated in place
+	 * as part of split that follows a posting list split
+	 */
+	if (nposting != NULL)
+		replacepostingoff = OffsetNumberPrev(newitemoff);
+
+	/*
 	 * origpage is the original page to be split.  leftpage is a temporary
 	 * buffer that receives the left-sibling data, which will be copied back
 	 * into origpage on success.  rightpage is the new page that will receive
@@ -1273,6 +1452,13 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 	 * newitemoff == firstright.  In all other cases it's clear which side of
 	 * the split every tuple goes on from context.  newitemonleft is usually
 	 * (but not always) redundant information.
+	 *
+	 * Note: In theory, the split point choice logic should operate against a
+	 * version of the page that already replaced the posting list at offset
+	 * replacepostingoff with nposting where applicable.  We don't bother with
+	 * that, though.  Both versions of the posting list must be the same size
+	 * and have the same key values, so this omission can't affect the split
+	 * point chosen in practice.
 	 */
 	firstright = _bt_findsplitloc(rel, origpage, newitemoff, newitemsz,
 								  newitem, &newitemonleft);
@@ -1340,6 +1526,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemid = PageGetItemId(origpage, firstright);
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
+		/* Behave as if origpage posting list has already been swapped */
+		if (firstright == replacepostingoff)
+			item = nposting;
 	}
 
 	/*
@@ -1373,6 +1562,9 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 			Assert(lastleftoff >= P_FIRSTDATAKEY(oopaque));
 			itemid = PageGetItemId(origpage, lastleftoff);
 			lastleft = (IndexTuple) PageGetItem(origpage, itemid);
+			/* Behave as if origpage posting list has already been swapped */
+			if (lastleftoff == replacepostingoff)
+				lastleft = nposting;
 		}
 
 		Assert(lastleft != item);
@@ -1480,8 +1672,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		itemsz = ItemIdGetLength(itemid);
 		item = (IndexTuple) PageGetItem(origpage, itemid);
 
+		/*
+		 * did caller pass new replacement posting list tuple due to posting
+		 * list split?
+		 */
+		if (i == replacepostingoff)
+		{
+			/*
+			 * swap origpage posting list with post-posting-list-split version
+			 * from caller
+			 */
+			Assert(isleaf);
+			Assert(itemsz == MAXALIGN(IndexTupleSize(nposting)));
+			item = nposting;
+		}
+
 		/* does new item belong before this one? */
-		if (i == newitemoff)
+		else if (i == newitemoff)
 		{
 			if (newitemonleft)
 			{
@@ -1653,6 +1860,17 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		xlrec.firstright = firstright;
 		xlrec.newitemoff = newitemoff;
 
+		/*
+		 * If replacing posting item was put on the right page,
+		 * we don't need to explicitly WAL log it because it's included
+		 * with all the other items on the right page.
+		 * Otherwise, save in_posting_offset and newitem to construct
+		 * replacing tuple.
+		 */
+		xlrec.in_posting_offset = InvalidOffsetNumber;
+		if (replacepostingoff < firstright)
+			xlrec.in_posting_offset = in_posting_offset;
+
 		XLogBeginInsert();
 		XLogRegisterData((char *) &xlrec, SizeOfBtreeSplit);
 
@@ -1672,9 +1890,23 @@ _bt_split(Relation rel, BTScanInsert itup_key, Buffer buf, Buffer cbuf,
 		 * is not stored if XLogInsert decides it needs a full-page image of
 		 * the left page.  We store the offset anyway, though, to support
 		 * archive compression of these records.
+		 *
+		 * Also save newitem in case posting split was required
+		 * to construct new posting.
 		 */
-		if (newitemonleft)
-			XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		if (newitemonleft || xlrec.in_posting_offset)
+		{
+			if (xlrec.in_posting_offset)
+			{
+				Assert(original_newitem != NULL);
+				Assert(ItemPointerCompare(&original_newitem->t_tid, &newitem->t_tid) != 0);
+
+				XLogRegisterBufData(0, (char *) original_newitem,
+									MAXALIGN(IndexTupleSize(original_newitem)));
+			}
+			else
+				XLogRegisterBufData(0, (char *) newitem, MAXALIGN(newitemsz));
+		}
 
 		/* Log the left page's new high key */
 		itemid = PageGetItemId(origpage, P_HIKEY);
@@ -1834,7 +2066,7 @@ _bt_insert_parent(Relation rel,
 
 		/* Recursively insert into the parent */
 		_bt_insertonpg(rel, NULL, pbuf, buf, stack->bts_parent,
-					   new_item, stack->bts_offset + 1,
+					   new_item, stack->bts_offset + 1, 0,
 					   is_only);
 
 		/* be tidy */
@@ -2304,6 +2536,277 @@ _bt_vacuum_one_page(Relation rel, Buffer buffer, Relation heapRel)
 	 * Note: if we didn't find any LP_DEAD items, then the page's
 	 * BTP_HAS_GARBAGE hint bit is falsely set.  We do not bother expending a
 	 * separate write to clear it, however.  We will clear it when we split
-	 * the page.
+	 * the page (or when deduplication runs).
+	 */
+}
+
+/*
+ * Try to deduplicate items to free some space.  If we don't proceed with
+ * deduplication, buffer will contain old state of the page.
+ *
+ * 'itemsz' is the size of the inserter caller's incoming/new tuple, not
+ * including line pointer overhead.  This is the amount of space we'll need to
+ * free in order to let caller avoid splitting the page.
+ *
+ * This function should be called after LP_DEAD items were removed by
+ * _bt_vacuum_one_page() to prevent a page split.  (It's possible that we'll
+ * have to kill additional LP_DEAD items, but that should be rare.)
+ */
+static void
+_bt_dedup_one_page(Relation rel, Buffer buffer, Relation heapRel, Size itemsz)
+{
+	OffsetNumber offnum,
+				minoff,
+				maxoff;
+	Page		page = BufferGetPage(buffer);
+	Page		newpage;
+	BTPageOpaque oopaque,
+				nopaque;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+	int			natts = IndexRelationGetNumberOfAttributes(rel);
+	OffsetNumber deletable[MaxOffsetNumber];
+	int			ndeletable = 0;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(rel) ==
+				   IndexRelationGetNumberOfAttributes(rel) &&
+				   !rel->rd_index->indisunique);
+	if (!deduplicate)
+		return;
+
+	oopaque = (BTPageOpaque) PageGetSpecialPointer(page);
+	/* init deduplication state needed to build posting tuples */
+	dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+	dedupState->ipd = NULL;
+	dedupState->ntuples = 0;
+	dedupState->itupprev = NULL;
+	dedupState->maxitemsize = BTMaxItemSize(page);
+	dedupState->maxpostingsize = 0;
+
+	minoff = P_FIRSTDATAKEY(oopaque);
+	maxoff = PageGetMaxOffsetNumber(page);
+
+	/*
+	 * Delete dead tuples if any. We cannot simply skip them in the cycle
+	 * below, because it's necessary to generate special Xlog record
+	 * containing such tuples to compute latestRemovedXid on a standby server
+	 * later.
+	 *
+	 * This should not affect performance, since it only can happen in a rare
+	 * situation when BTP_HAS_GARBAGE flag was not set and _bt_vacuum_one_page
+	 * was not called, or _bt_vacuum_one_page didn't remove all dead items.
 	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+
+		if (ItemIdIsDead(itemid))
+			deletable[ndeletable++] = offnum;
+	}
+
+	if (ndeletable > 0)
+	{
+		/*
+		 * Skip duplication in rare cases where there were LP_DEAD items
+		 * encountered here when that frees sufficient space for caller to
+		 * avoid a page split
+		 */
+		_bt_delitems_delete(rel, buffer, deletable, ndeletable, heapRel);
+		if (PageGetFreeSpace(page) >= itemsz)
+		{
+			pfree(dedupState);
+			return;
+		}
+
+		/* Continue with deduplication */
+		minoff = P_FIRSTDATAKEY(oopaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+	}
+
+	/*
+	 * Scan over all items to see which ones can be deduplicated
+	 */
+	newpage = PageGetTempPageCopySpecial(page);
+	nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+	/* Make sure that new page won't have garbage flag set */
+	nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+	/* Copy High Key if any */
+	if (!P_RIGHTMOST(oopaque))
+	{
+		ItemId		hitemid = PageGetItemId(page, P_HIKEY);
+		Size		hitemsz = ItemIdGetLength(hitemid);
+		IndexTuple	hitem = (IndexTuple) PageGetItem(page, hitemid);
+
+		if (PageAddItem(newpage, (Item) hitem, hitemsz, P_HIKEY,
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to add highkey during deduplication");
+	}
+
+	/*
+	 * Iterate over tuples on the page, try to deduplicate them into posting
+	 * lists and insert into new page.
+	 */
+	for (offnum = minoff;
+		 offnum <= maxoff;
+		 offnum = OffsetNumberNext(offnum))
+	{
+		ItemId		itemid = PageGetItemId(page, offnum);
+		IndexTuple	itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(!ItemIdIsDead(itemid));
+
+		if (dedupState->itupprev == NULL)
+		{
+			/* Just set up base/first item in first iteration */
+			Assert(offnum == minoff);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+			continue;
+		}
+
+		if (deduplicate &&
+			_bt_keep_natts_fast(rel, dedupState->itupprev, itup) > natts)
+		{
+			int			itup_ntuples;
+			Size		projpostingsz;
+
+			/*
+			 * Tuples are equal.
+			 *
+			 * If posting list does not exceed tuple size limit then append
+			 * the tuple to the pending posting list.  Otherwise, insert it on
+			 * page and continue with this tuple as new pending posting list.
+			 */
+			itup_ntuples = BTreeTupleIsPosting(itup) ?
+				BTreeTupleGetNPosting(itup) : 1;
+
+			/*
+			 * Project size of new posting list that would result from merging
+			 * current tup with pending posting list (could just be prev item
+			 * that's "pending").
+			 *
+			 * This accounting looks odd, but it's correct because ...
+			 */
+			projpostingsz = MAXALIGN(IndexTupleSize(dedupState->itupprev) +
+									 (dedupState->ntuples + itup_ntuples + 1) *
+									 sizeof(ItemPointerData));
+
+			if (projpostingsz <= dedupState->maxitemsize)
+				_bt_stash_item_tid(dedupState, itup, offnum);
+			else
+				_bt_dedup_insert(newpage, dedupState);
+		}
+		else
+		{
+			/*
+			 * Tuples are not equal, or we're done deduplicating this page.
+			 *
+			 * Insert pending posting list on page.  This could just be a
+			 * regular tuple.
+			 */
+			_bt_dedup_insert(newpage, dedupState);
+		}
+
+		pfree(dedupState->itupprev);
+		dedupState->itupprev = CopyIndexTuple(itup);
+		dedupState->itupprev_off = offnum;
+
+		Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	}
+
+	/* Handle the last item */
+	_bt_dedup_insert(newpage, dedupState);
+
+	/*
+	 * If no items suitable for deduplication were found, newpage must be
+	 * exactly the same as the original page, so just return from function.
+	 */
+	if (dedupState->n_intervals == 0)
+	{
+		pfree(dedupState);
+		return;
+	}
+
+	START_CRIT_SECTION();
+
+	PageRestoreTempPage(newpage, page);
+	MarkBufferDirty(buffer);
+
+	/* Log full page write */
+	if (RelationNeedsWAL(rel))
+	{
+		XLogRecPtr	recptr;
+		xl_btree_dedup xlrec_dedup;
+
+		xlrec_dedup.n_intervals =  dedupState->n_intervals;
+
+		XLogBeginInsert();
+		XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
+		XLogRegisterData((char *) &xlrec_dedup, SizeOfBtreeDedup);
+
+		/* only save non-empthy part of the array */
+		if (dedupState->n_intervals > 0)
+			XLogRegisterData((char *) dedupState->dedup_intervals,
+							 dedupState->n_intervals * sizeof(dedupInterval));
+
+		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_DEDUP_PAGE);
+
+		PageSetLSN(page, recptr);
+	}
+
+	END_CRIT_SECTION();
+
+	/* be tidy */
+	pfree(dedupState);
+}
+
+/*
+ * Add new posting tuple item to the page based on itupprev and saved list of
+ * heap TIDs.
+ */
+void
+_bt_dedup_insert(Page page, BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+	OffsetNumber offnum = PageGetMaxOffsetNumber(page);
+
+	if (dedupState->ntuples == 0)
+	{
+		/*
+		 * Use original itupprev, which may or may not be a posting list
+		 * already from some earlier dedup attempt
+		 */
+		to_insert = dedupState->itupprev;
+	}
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	Assert(IndexTupleSize(dedupState->itupprev) <= dedupState->maxitemsize);
+	/* Add the new item into the page */
+	offnum = OffsetNumberNext(offnum);
+
+	if (PageAddItem(page, (Item) to_insert, IndexTupleSize(to_insert),
+					offnum, false, false) == InvalidOffsetNumber)
+		elog(ERROR, "deduplication failed to add tuple to page");
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
 }
diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index 268f869..5314bbe 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -24,6 +24,7 @@
 
 #include "access/nbtree.h"
 #include "access/nbtxlog.h"
+#include "access/tableam.h"
 #include "access/transam.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -42,6 +43,11 @@ static bool _bt_lock_branch_parent(Relation rel, BlockNumber child,
 								   BlockNumber *target, BlockNumber *rightsib);
 static void _bt_log_reuse_page(Relation rel, BlockNumber blkno,
 							   TransactionId latestRemovedXid);
+static TransactionId _bt_compute_xid_horizon_for_tuples(Relation rel,
+														Relation heapRel,
+														Buffer buf,
+														OffsetNumber *itemnos,
+														int nitems);
 
 /*
  *	_bt_initmetapage() -- Fill a page buffer with a correct metapage image
@@ -983,14 +989,52 @@ _bt_page_recyclable(Page page)
 void
 _bt_delitems_vacuum(Relation rel, Buffer buf,
 					OffsetNumber *itemnos, int nitems,
+					OffsetNumber *remainingoffset,
+					IndexTuple *remaining, int nremaining,
 					BlockNumber lastBlockVacuumed)
 {
 	Page		page = BufferGetPage(buf);
 	BTPageOpaque opaque;
+	Size		itemsz;
+	Size		remaining_sz = 0;
+	char	   *remaining_buf = NULL;
+
+	/* XLOG stuff, buffer for remainings */
+	if (nremaining && RelationNeedsWAL(rel))
+	{
+		Size		offset = 0;
+
+		for (int i = 0; i < nremaining; i++)
+			remaining_sz += MAXALIGN(IndexTupleSize(remaining[i]));
+
+		remaining_buf = palloc0(remaining_sz);
+		for (int i = 0; i < nremaining; i++)
+		{
+			itemsz = IndexTupleSize(remaining[i]);
+			memcpy(remaining_buf + offset, (char *) remaining[i], itemsz);
+			offset += MAXALIGN(itemsz);
+		}
+		Assert(offset == remaining_sz);
+	}
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
 
+	/* Handle posting tuples here */
+	for (int i = 0; i < nremaining; i++)
+	{
+		/* At first, delete the old tuple. */
+		PageIndexTupleDelete(page, remainingoffset[i]);
+
+		itemsz = IndexTupleSize(remaining[i]);
+		itemsz = MAXALIGN(itemsz);
+
+		/* Add tuple with remaining ItemPointers to the page. */
+		if (PageAddItem(page, (Item) remaining[i], itemsz, remainingoffset[i],
+						false, false) == InvalidOffsetNumber)
+			elog(ERROR, "failed to rewrite posting list item in index while doing vacuum");
+	}
+
 	/* Fix the page */
 	if (nitems > 0)
 		PageIndexMultiDelete(page, itemnos, nitems);
@@ -1020,6 +1064,8 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		xl_btree_vacuum xlrec_vacuum;
 
 		xlrec_vacuum.lastBlockVacuumed = lastBlockVacuumed;
+		xlrec_vacuum.nremaining = nremaining;
+		xlrec_vacuum.ndeleted = nitems;
 
 		XLogBeginInsert();
 		XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
@@ -1033,6 +1079,19 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 		if (nitems > 0)
 			XLogRegisterBufData(0, (char *) itemnos, nitems * sizeof(OffsetNumber));
 
+		/*
+		 * Here we should save offnums and remaining tuples themselves. It's
+		 * important to restore them in correct order. At first, we must
+		 * handle remaining tuples and only after that other deleted items.
+		 */
+		if (nremaining > 0)
+		{
+			Assert(remaining_buf != NULL);
+			XLogRegisterBufData(0, (char *) remainingoffset,
+								nremaining * sizeof(OffsetNumber));
+			XLogRegisterBufData(0, remaining_buf, remaining_sz);
+		}
+
 		recptr = XLogInsert(RM_BTREE_ID, XLOG_BTREE_VACUUM);
 
 		PageSetLSN(page, recptr);
@@ -1042,6 +1101,91 @@ _bt_delitems_vacuum(Relation rel, Buffer buf,
 }
 
 /*
+ * Get the latestRemovedXid from the table entries pointed at by the index
+ * tuples being deleted.
+ *
+ * This is a version of index_compute_xid_horizon_for_tuples() specialized to
+ * nbtree, which can handle posting lists.
+ */
+static TransactionId
+_bt_compute_xid_horizon_for_tuples(Relation rel, Relation heapRel,
+								   Buffer buf, OffsetNumber *itemnos,
+								   int nitems)
+{
+	ItemPointerData *ttids;
+	TransactionId latestRemovedXid = InvalidTransactionId;
+	Page		page = BufferGetPage(buf);
+	int			arraynitems;
+	int			finalnitems;
+
+	/*
+	 * Initial size of array can fit everything when it turns out that are no
+	 * posting lists
+	 */
+	arraynitems = nitems;
+	ttids = (ItemPointerData *) palloc(sizeof(ItemPointerData) * arraynitems);
+
+	finalnitems = 0;
+	/* identify what the index tuples about to be deleted point to */
+	for (int i = 0; i < nitems; i++)
+	{
+		ItemId		itemid;
+		IndexTuple	itup;
+
+		itemid = PageGetItemId(page, itemnos[i]);
+		itup = (IndexTuple) PageGetItem(page, itemid);
+
+		Assert(ItemIdIsDead(itemid));
+
+		if (!BTreeTupleIsPosting(itup))
+		{
+			/* Make sure that we have space for additional heap TID */
+			if (finalnitems + 1 > arraynitems)
+			{
+				arraynitems = arraynitems * 2;
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			Assert(ItemPointerIsValid(&itup->t_tid));
+			ItemPointerCopy(&itup->t_tid, &ttids[finalnitems]);
+			finalnitems++;
+		}
+		else
+		{
+			int			nposting = BTreeTupleGetNPosting(itup);
+
+			/* Make sure that we have space for additional heap TIDs */
+			if (finalnitems + nposting > arraynitems)
+			{
+				arraynitems = Max(arraynitems * 2, finalnitems + nposting);
+				ttids = (ItemPointerData *)
+					repalloc(ttids, sizeof(ItemPointerData) * arraynitems);
+			}
+
+			for (int j = 0; j < nposting; j++)
+			{
+				ItemPointer htid = BTreeTupleGetPostingN(itup, j);
+
+				Assert(ItemPointerIsValid(htid));
+				ItemPointerCopy(htid, &ttids[finalnitems]);
+				finalnitems++;
+			}
+		}
+	}
+
+	Assert(finalnitems >= nitems);
+
+	/* determine the actual xid horizon */
+	latestRemovedXid =
+		table_compute_xid_horizon_for_tuples(heapRel, ttids, finalnitems);
+
+	pfree(ttids);
+
+	return latestRemovedXid;
+}
+
+/*
  * Delete item(s) from a btree page during single-page cleanup.
  *
  * As above, must only be used on leaf pages.
@@ -1067,8 +1211,8 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 
 	if (XLogStandbyInfoActive() && RelationNeedsWAL(rel))
 		latestRemovedXid =
-			index_compute_xid_horizon_for_tuples(rel, heapRel, buf,
-												 itemnos, nitems);
+			_bt_compute_xid_horizon_for_tuples(rel, heapRel, buf,
+											   itemnos, nitems);
 
 	/* No ereport(ERROR) until changes are logged */
 	START_CRIT_SECTION();
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 4cfd528..6759531 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -97,6 +97,8 @@ static void btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 						 BTCycleId cycleid, TransactionId *oldestBtpoXact);
 static void btvacuumpage(BTVacState *vstate, BlockNumber blkno,
 						 BlockNumber orig_blkno);
+static ItemPointer btreevacuumPosting(BTVacState *vstate, IndexTuple itup,
+									  int *nremaining);
 
 
 /*
@@ -263,8 +265,8 @@ btgettuple(IndexScanDesc scan, ScanDirection dir)
 				 */
 				if (so->killedItems == NULL)
 					so->killedItems = (int *)
-						palloc(MaxIndexTuplesPerPage * sizeof(int));
-				if (so->numKilled < MaxIndexTuplesPerPage)
+						palloc(MaxPostingIndexTuplesPerPage * sizeof(int));
+				if (so->numKilled < MaxPostingIndexTuplesPerPage)
 					so->killedItems[so->numKilled++] = so->currPos.itemIndex;
 			}
 
@@ -1069,7 +1071,8 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
 								 RBM_NORMAL, info->strategy);
 		LockBufferForCleanup(buf);
 		_bt_checkpage(rel, buf);
-		_bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+		_bt_delitems_vacuum(rel, buf, NULL, 0, NULL, NULL, 0,
+							vstate.lastBlockVacuumed);
 		_bt_relbuf(rel, buf);
 	}
 
@@ -1193,6 +1196,9 @@ restart:
 		OffsetNumber offnum,
 					minoff,
 					maxoff;
+		IndexTuple	remaining[MaxOffsetNumber];
+		OffsetNumber remainingoffset[MaxOffsetNumber];
+		int			nremaining;
 
 		/*
 		 * Trade in the initial read lock for a super-exclusive write lock on
@@ -1229,6 +1235,7 @@ restart:
 		 * callback function.
 		 */
 		ndeletable = 0;
+		nremaining = 0;
 		minoff = P_FIRSTDATAKEY(opaque);
 		maxoff = PageGetMaxOffsetNumber(page);
 		if (callback)
@@ -1242,31 +1249,79 @@ restart:
 
 				itup = (IndexTuple) PageGetItem(page,
 												PageGetItemId(page, offnum));
-				htup = &(itup->t_tid);
 
-				/*
-				 * During Hot Standby we currently assume that
-				 * XLOG_BTREE_VACUUM records do not produce conflicts. That is
-				 * only true as long as the callback function depends only
-				 * upon whether the index tuple refers to heap tuples removed
-				 * in the initial heap scan. When vacuum starts it derives a
-				 * value of OldestXmin. Backends taking later snapshots could
-				 * have a RecentGlobalXmin with a later xid than the vacuum's
-				 * OldestXmin, so it is possible that row versions deleted
-				 * after OldestXmin could be marked as killed by other
-				 * backends. The callback function *could* look at the index
-				 * tuple state in isolation and decide to delete the index
-				 * tuple, though currently it does not. If it ever did, we
-				 * would need to reconsider whether XLOG_BTREE_VACUUM records
-				 * should cause conflicts. If they did cause conflicts they
-				 * would be fairly harsh conflicts, since we haven't yet
-				 * worked out a way to pass a useful value for
-				 * latestRemovedXid on the XLOG_BTREE_VACUUM records. This
-				 * applies to *any* type of index that marks index tuples as
-				 * killed.
-				 */
-				if (callback(htup, callback_state))
-					deletable[ndeletable++] = offnum;
+				if (BTreeTupleIsPosting(itup))
+				{
+					int			nnewipd = 0;
+					ItemPointer newipd = NULL;
+
+					newipd = btreevacuumPosting(vstate, itup, &nnewipd);
+
+					if (nnewipd == 0)
+					{
+						/*
+						 * All TIDs from posting list must be deleted, we can
+						 * delete whole tuple in a regular way.
+						 */
+						deletable[ndeletable++] = offnum;
+					}
+					else if (nnewipd == BTreeTupleGetNPosting(itup))
+					{
+						/*
+						 * All TIDs from posting tuple must remain. Do
+						 * nothing, just cleanup.
+						 */
+						pfree(newipd);
+					}
+					else if (nnewipd < BTreeTupleGetNPosting(itup))
+					{
+						/* Some TIDs from posting tuple must remain. */
+						Assert(nnewipd > 0);
+						Assert(newipd != NULL);
+
+						/*
+						 * Form new tuple that contains only remaining TIDs.
+						 * Remember this tuple and the offset of the old tuple
+						 * to update it in place.
+						 */
+						remainingoffset[nremaining] = offnum;
+						remaining[nremaining] =
+							BTreeFormPostingTuple(itup, newipd, nnewipd);
+						nremaining++;
+						pfree(newipd);
+
+						Assert(IndexTupleSize(itup) <= BTMaxItemSize(page));
+					}
+				}
+				else
+				{
+					htup = &(itup->t_tid);
+
+					/*
+					 * During Hot Standby we currently assume that
+					 * XLOG_BTREE_VACUUM records do not produce conflicts.
+					 * That is only true as long as the callback function
+					 * depends only upon whether the index tuple refers to
+					 * heap tuples removed in the initial heap scan. When
+					 * vacuum starts it derives a value of OldestXmin.
+					 * Backends taking later snapshots could have a
+					 * RecentGlobalXmin with a later xid than the vacuum's
+					 * OldestXmin, so it is possible that row versions deleted
+					 * after OldestXmin could be marked as killed by other
+					 * backends. The callback function *could* look at the
+					 * index tuple state in isolation and decide to delete the
+					 * index tuple, though currently it does not. If it ever
+					 * did, we would need to reconsider whether
+					 * XLOG_BTREE_VACUUM records should cause conflicts. If
+					 * they did cause conflicts they would be fairly harsh
+					 * conflicts, since we haven't yet worked out a way to
+					 * pass a useful value for latestRemovedXid on the
+					 * XLOG_BTREE_VACUUM records. This applies to *any* type
+					 * of index that marks index tuples as killed.
+					 */
+					if (callback(htup, callback_state))
+						deletable[ndeletable++] = offnum;
+				}
 			}
 		}
 
@@ -1274,7 +1329,7 @@ restart:
 		 * Apply any needed deletes.  We issue just one _bt_delitems_vacuum()
 		 * call per page, so as to minimize WAL traffic.
 		 */
-		if (ndeletable > 0)
+		if (ndeletable > 0 || nremaining > 0)
 		{
 			/*
 			 * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
@@ -1291,6 +1346,7 @@ restart:
 			 * that.
 			 */
 			_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
+								remainingoffset, remaining, nremaining,
 								vstate->lastBlockVacuumed);
 
 			/*
@@ -1376,6 +1432,41 @@ restart:
 }
 
 /*
+ * btreevacuumPosting() -- vacuums a posting tuple.
+ *
+ * Returns new palloc'd posting list with remaining items.
+ * Posting list size is returned via nremaining.
+ *
+ * If all items are dead,
+ * nremaining is 0 and resulting posting list is NULL.
+ */
+static ItemPointer
+btreevacuumPosting(BTVacState *vstate, IndexTuple itup, int *nremaining)
+{
+	int			remaining = 0;
+	int			nitem = BTreeTupleGetNPosting(itup);
+	ItemPointer tmpitems = NULL,
+				items = BTreeTupleGetPosting(itup);
+
+	/*
+	 * Check each tuple in the posting list, save alive tuples into tmpitems
+	 */
+	for (int i = 0; i < nitem; i++)
+	{
+		if (vstate->callback(items + i, vstate->callback_state))
+			continue;
+
+		if (tmpitems == NULL)
+			tmpitems = palloc(sizeof(ItemPointerData) * nitem);
+
+		tmpitems[remaining++] = items[i];
+	}
+
+	*nremaining = remaining;
+	return tmpitems;
+}
+
+/*
  *	btcanreturn() -- Check whether btree indexes support index-only scans.
  *
  * btrees always do, so this is trivial.
diff --git a/src/backend/access/nbtree/nbtsearch.c b/src/backend/access/nbtree/nbtsearch.c
index 8e51246..c78c8e6 100644
--- a/src/backend/access/nbtree/nbtsearch.c
+++ b/src/backend/access/nbtree/nbtsearch.c
@@ -26,10 +26,18 @@
 
 static void _bt_drop_lock_and_maybe_pin(IndexScanDesc scan, BTScanPos sp);
 static OffsetNumber _bt_binsrch(Relation rel, BTScanInsert key, Buffer buf);
+static int	_bt_binsrch_posting(BTScanInsert key, Page page,
+								OffsetNumber offnum);
 static bool _bt_readpage(IndexScanDesc scan, ScanDirection dir,
 						 OffsetNumber offnum);
 static void _bt_saveitem(BTScanOpaque so, int itemIndex,
 						 OffsetNumber offnum, IndexTuple itup);
+static void _bt_setuppostingitems(BTScanOpaque so, int itemIndex,
+								  OffsetNumber offnum, ItemPointer iptr,
+								  IndexTuple itup);
+static inline void _bt_savepostingitem(BTScanOpaque so, int itemIndex,
+									   OffsetNumber offnum, ItemPointer iptr,
+									   IndexTuple itup);
 static bool _bt_steppage(IndexScanDesc scan, ScanDirection dir);
 static bool _bt_readnextpage(IndexScanDesc scan, BlockNumber blkno, ScanDirection dir);
 static bool _bt_parallel_readpage(IndexScanDesc scan, BlockNumber blkno,
@@ -434,7 +442,10 @@ _bt_binsrch(Relation rel,
  * low) makes bounds invalid.
  *
  * Caller is responsible for invalidating bounds when it modifies the page
- * before calling here a second time.
+ * before calling here a second time, and for dealing with posting list
+ * tuple matches (callers can use insertstate's in_posting_offset field to
+ * determine which existing heap TID will need to be replaced by their
+ * scantid/new heap TID).
  */
 OffsetNumber
 _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
@@ -453,6 +464,7 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 
 	Assert(P_ISLEAF(opaque));
 	Assert(!key->nextkey);
+	Assert(insertstate->in_posting_offset == 0);
 
 	if (!insertstate->bounds_valid)
 	{
@@ -509,6 +521,17 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 			if (result != 0)
 				stricthigh = high;
 		}
+
+		/*
+		 * If tuple at offset located by binary search is a posting list whose
+		 * TID range overlaps with caller's scantid, perform posting list
+		 * binary search to set in_posting_offset for caller.  Caller must
+		 * split the posting list when in_posting_offset is set.  This should
+		 * happen infrequently.
+		 */
+		if (unlikely(result == 0 && key->scantid != NULL))
+			insertstate->in_posting_offset =
+				_bt_binsrch_posting(key, page, mid);
 	}
 
 	/*
@@ -529,6 +552,68 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
 }
 
 /*----------
+ *	_bt_binsrch_posting() -- posting list binary search.
+ *
+ * Returns offset into posting list where caller's scantid belongs.
+ *----------
+ */
+static int
+_bt_binsrch_posting(BTScanInsert key, Page page, OffsetNumber offnum)
+{
+	IndexTuple	itup;
+	ItemId		itemid;
+	int			low,
+				high,
+				mid,
+				res;
+
+	/*
+	 * If this isn't a posting tuple, then the index must be corrupt (if it is
+	 * an ordinary non-pivot tuple then there must be an existing tuple with a
+	 * heap TID that equals inserter's new heap TID/scantid).  Defensively
+	 * check that tuple is a posting list tuple whose posting list range
+	 * includes caller's scantid.
+	 *
+	 * (This is also needed because contrib/amcheck's rootdescend option needs
+	 * to be able to relocate a non-pivot tuple using _bt_binsrch_insert().)
+	 */
+	Assert(P_ISLEAF((BTPageOpaque) PageGetSpecialPointer(page)));
+	Assert(!key->nextkey);
+	Assert(key->scantid != NULL);
+	itemid = PageGetItemId(page, offnum);
+	itup = (IndexTuple) PageGetItem(page, itemid);
+	if (!BTreeTupleIsPosting(itup))
+		return 0;
+
+	/*
+	 * In the unlikely event that posting list tuple has LP_DEAD bit set,
+	 * signal to caller that it should kill the item and restart its binary
+	 * search.
+	 */
+	if (ItemIdIsDead(itemid))
+		return -1;
+
+	/* "high" is past end of posting list for loop invariant */
+	low = 0;
+	high = BTreeTupleGetNPosting(itup);
+	Assert(high >= 2);
+
+	while (high > low)
+	{
+		mid = low + ((high - low) / 2);
+		res = ItemPointerCompare(key->scantid,
+								 BTreeTupleGetPostingN(itup, mid));
+
+		if (res >= 1)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	return low;
+}
+
+/*----------
  *	_bt_compare() -- Compare insertion-type scankey to tuple on a page.
  *
  *	page/offnum: location of btree item to be compared to.
@@ -537,9 +622,18 @@ _bt_binsrch_insert(Relation rel, BTInsertState insertstate)
  *			<0 if scankey < tuple at offnum;
  *			 0 if scankey == tuple at offnum;
  *			>0 if scankey > tuple at offnum.
- *		NULLs in the keys are treated as sortable values.  Therefore
- *		"equality" does not necessarily mean that the item should be
- *		returned to the caller as a matching key!
+ *
+ * NULLs in the keys are treated as sortable values.  Therefore
+ * "equality" does not necessarily mean that the item should be returned
+ * to the caller as a matching key.  Similarly, an insertion scankey
+ * with its scantid set is treated as equal to a posting tuple whose TID
+ * range overlaps with their scantid.  There generally won't be a
+ * matching TID in the posting tuple, which caller must handle
+ * themselves (e.g., by splitting the posting list tuple).
+ *
+ * It is generally guaranteed that any possible scankey with scantid set
+ * will have zero or one tuples in the index that are considered equal
+ * here.
  *
  * CRUCIAL NOTE: on a non-leaf page, the first data key is assumed to be
  * "minus infinity": this routine will always claim it is less than the
@@ -563,6 +657,7 @@ _bt_compare(Relation rel,
 	ScanKey		scankey;
 	int			ncmpkey;
 	int			ntupatts;
+	int32		result;
 
 	Assert(_bt_check_natts(rel, key->heapkeyspace, page, offnum));
 	Assert(key->keysz <= IndexRelationGetNumberOfKeyAttributes(rel));
@@ -597,7 +692,6 @@ _bt_compare(Relation rel,
 	{
 		Datum		datum;
 		bool		isNull;
-		int32		result;
 
 		datum = index_getattr(itup, scankey->sk_attno, itupdesc, &isNull);
 
@@ -713,8 +807,24 @@ _bt_compare(Relation rel,
 	if (heapTid == NULL)
 		return 1;
 
+	/*
+	 * scankey must be treated as equal to a posting list tuple if its scantid
+	 * value falls within the range of the posting list.  In all other cases
+	 * there can only be a single heap TID value, which is compared directly
+	 * as a simple scalar value.
+	 */
 	Assert(ntupatts >= IndexRelationGetNumberOfKeyAttributes(rel));
-	return ItemPointerCompare(key->scantid, heapTid);
+	result = ItemPointerCompare(key->scantid, heapTid);
+	if (!BTreeTupleIsPosting(itup) || result <= 0)
+		return result;
+	else
+	{
+		result = ItemPointerCompare(key->scantid, BTreeTupleGetMaxTID(itup));
+		if (result > 0)
+			return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -1451,6 +1561,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 	/* initialize tuple workspace to empty */
 	so->currPos.nextTupleOffset = 0;
+	so->currPos.postingTupleOffset = 0;
 
 	/*
 	 * Now that the current page has been made consistent, the macro should be
@@ -1485,8 +1596,30 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (_bt_checkkeys(scan, itup, indnatts, dir, &continuescan))
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				_bt_saveitem(so, itemIndex, offnum, itup);
-				itemIndex++;
+				if (!BTreeTupleIsPosting(itup))
+				{
+					_bt_saveitem(so, itemIndex, offnum, itup);
+					itemIndex++;
+				}
+				else
+				{
+					/*
+					 * Setup state to return posting list, and save first
+					 * "logical" tuple
+					 */
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, 0),
+										  itup);
+					itemIndex++;
+					/* Save additional posting list "logical" tuples */
+					for (int i = 1; i < BTreeTupleGetNPosting(itup); i++)
+					{
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+						itemIndex++;
+					}
+				}
 			}
 			/* When !continuescan, there can't be any more matches, so stop */
 			if (!continuescan)
@@ -1519,7 +1652,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 		if (!continuescan)
 			so->currPos.moreRight = false;
 
-		Assert(itemIndex <= MaxIndexTuplesPerPage);
+		Assert(itemIndex <= MaxPostingIndexTuplesPerPage);
 		so->currPos.firstItem = 0;
 		so->currPos.lastItem = itemIndex - 1;
 		so->currPos.itemIndex = 0;
@@ -1527,7 +1660,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 	else
 	{
 		/* load items[] in descending order */
-		itemIndex = MaxIndexTuplesPerPage;
+		itemIndex = MaxPostingIndexTuplesPerPage;
 
 		offnum = Min(offnum, maxoff);
 
@@ -1569,8 +1702,37 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 			if (passes_quals && tuple_alive)
 			{
 				/* tuple passes all scan key conditions, so remember it */
-				itemIndex--;
-				_bt_saveitem(so, itemIndex, offnum, itup);
+				if (!BTreeTupleIsPosting(itup))
+				{
+					itemIndex--;
+					_bt_saveitem(so, itemIndex, offnum, itup);
+				}
+				else
+				{
+					int			i = BTreeTupleGetNPosting(itup) - 1;
+
+					/*
+					 * Setup state to return posting list, and save last
+					 * "logical" tuple from posting list (since it's the first
+					 * that will be returned to scan).
+					 */
+					itemIndex--;
+					_bt_setuppostingitems(so, itemIndex, offnum,
+										  BTreeTupleGetPostingN(itup, i--),
+										  itup);
+
+					/*
+					 * Return posting list "logical" tuples -- do this in
+					 * descending order, to match overall scan order
+					 */
+					for (; i >= 0; i--)
+					{
+						itemIndex--;
+						_bt_savepostingitem(so, itemIndex, offnum,
+											BTreeTupleGetPostingN(itup, i),
+											itup);
+					}
+				}
 			}
 			if (!continuescan)
 			{
@@ -1584,8 +1746,8 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum)
 
 		Assert(itemIndex >= 0);
 		so->currPos.firstItem = itemIndex;
-		so->currPos.lastItem = MaxIndexTuplesPerPage - 1;
-		so->currPos.itemIndex = MaxIndexTuplesPerPage - 1;
+		so->currPos.lastItem = MaxPostingIndexTuplesPerPage - 1;
+		so->currPos.itemIndex = MaxPostingIndexTuplesPerPage - 1;
 	}
 
 	return (so->currPos.firstItem <= so->currPos.lastItem);
@@ -1598,6 +1760,8 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 {
 	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
 
+	Assert(!BTreeTupleIsPosting(itup));
+
 	currItem->heapTid = itup->t_tid;
 	currItem->indexOffset = offnum;
 	if (so->currTuples)
@@ -1611,6 +1775,61 @@ _bt_saveitem(BTScanOpaque so, int itemIndex,
 }
 
 /*
+ * Setup state to save posting items from a single posting list tuple.  Saves
+ * the logical tuple that will be returned to scan first in passing.
+ *
+ * Saves an index item into so->currPos.items[itemIndex] for logical tuple
+ * that is returned to scan first.  Second or subsequent heap TID for posting
+ * list should be saved by calling _bt_savepostingitem().
+ */
+static void
+_bt_setuppostingitems(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					  ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/* Save a truncated version of the IndexTuple */
+		Size		itupsz = BTreeTupleGetPostingOffset(itup);
+
+		itupsz = MAXALIGN(itupsz);
+		currItem->tupleOffset = so->currPos.nextTupleOffset;
+		memcpy(so->currTuples + so->currPos.nextTupleOffset, itup, itupsz);
+		so->currPos.nextTupleOffset += itupsz;
+		so->currPos.postingTupleOffset = currItem->tupleOffset;
+	}
+}
+
+/*
+ * Save an index item into so->currPos.items[itemIndex] for posting tuple.
+ *
+ * Assumes that _bt_setuppostingitems() has already been called for current
+ * posting list tuple.
+ */
+static inline void
+_bt_savepostingitem(BTScanOpaque so, int itemIndex, OffsetNumber offnum,
+					ItemPointer iptr, IndexTuple itup)
+{
+	BTScanPosItem *currItem = &so->currPos.items[itemIndex];
+
+	currItem->heapTid = *iptr;
+	currItem->indexOffset = offnum;
+
+	if (so->currTuples)
+	{
+		/*
+		 * Have index-only scans return the same truncated IndexTuple for
+		 * every logical tuple that originates from the same posting list
+		 */
+		currItem->tupleOffset = so->currPos.postingTupleOffset;
+	}
+}
+
+/*
  *	_bt_steppage() -- Step to next page containing valid data for scan
  *
  * On entry, if so->currPos.buf is valid the buffer is pinned but not locked;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index ab19692..4198770 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -288,6 +288,8 @@ static void _bt_sortaddtup(Page page, Size itemsize,
 static void _bt_buildadd(BTWriteState *wstate, BTPageState *state,
 						 IndexTuple itup);
 static void _bt_uppershutdown(BTWriteState *wstate, BTPageState *state);
+static void _bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+								 BTDedupState *dedupState);
 static void _bt_load(BTWriteState *wstate,
 					 BTSpool *btspool, BTSpool *btspool2);
 static void _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent,
@@ -830,6 +832,8 @@ _bt_sortaddtup(Page page,
  * the high key is to be truncated, offset 1 is deleted, and we insert
  * the truncated high key at offset 1.
  *
+ * Note that itup may be a posting list tuple.
+ *
  * 'last' pointer indicates the last offset added to the page.
  *----------
  */
@@ -963,6 +967,11 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 			 * Overwrite the old item with new truncated high key directly.
 			 * oitup is already located at the physical beginning of tuple
 			 * space, so this should directly reuse the existing tuple space.
+			 *
+			 * If lastleft tuple was a posting tuple, we'll truncate its
+			 * posting list in _bt_truncate as well. Note that it is also
+			 * applicable only to leaf pages, since internal pages never
+			 * contain posting tuples.
 			 */
 			ii = PageGetItemId(opage, OffsetNumberPrev(last_off));
 			lastleft = (IndexTuple) PageGetItem(opage, ii);
@@ -1002,6 +1011,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		 * the minimum key for the new page.
 		 */
 		state->btps_minkey = CopyIndexTuple(oitup);
+		Assert(BTreeTupleIsPivot(state->btps_minkey));
 
 		/*
 		 * Set the sibling links for both pages.
@@ -1043,6 +1053,7 @@ _bt_buildadd(BTWriteState *wstate, BTPageState *state, IndexTuple itup)
 		Assert(state->btps_minkey == NULL);
 		state->btps_minkey = CopyIndexTuple(itup);
 		/* _bt_sortaddtup() will perform full truncation later */
+		BTreeTupleClearBtIsPosting(state->btps_minkey);
 		BTreeTupleSetNAtts(state->btps_minkey, 0);
 	}
 
@@ -1128,6 +1139,136 @@ _bt_uppershutdown(BTWriteState *wstate, BTPageState *state)
 }
 
 /*
+ * Add new tuple (posting or non-posting) to the page while building index.
+ */
+static void
+_bt_buildadd_posting(BTWriteState *wstate, BTPageState *state,
+					 BTDedupState *dedupState)
+{
+	IndexTuple	to_insert;
+
+	/* Return, if there is no tuple to insert */
+	if (state == NULL)
+		return;
+
+	if (dedupState->ntuples == 0)
+		to_insert = dedupState->itupprev;
+	else
+	{
+		IndexTuple	postingtuple;
+
+		/* form a tuple with a posting list */
+		postingtuple = BTreeFormPostingTuple(dedupState->itupprev,
+											 dedupState->ipd,
+											 dedupState->ntuples);
+		to_insert = postingtuple;
+		pfree(dedupState->ipd);
+	}
+
+	_bt_buildadd(wstate, state, to_insert);
+
+	if (dedupState->ntuples > 0)
+		pfree(to_insert);
+	dedupState->ntuples = 0;
+}
+
+/*
+ * Save item pointer(s) of itup to the posting list in dedupState.
+ *
+ * 'itup' is current tuple on page, which comes immediately after equal
+ * 'itupprev' tuple stashed in dedup state at the point we're called.
+ *
+ * Helper function for _bt_load() and _bt_dedup_one_page(), called when it
+ * becomes clear that pending itupprev item will be part of a new/pending
+ * posting list, or when a pending/new posting list will contain a new heap
+ * TID from itup.
+ *
+ * Note: caller is responsible for the BTMaxItemSize() check.
+ */
+void
+_bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+				   OffsetNumber itup_offnum)
+{
+	int			nposting = 0;
+
+	if (dedupState->ntuples == 0)
+	{
+		dedupState->ipd = palloc0(dedupState->maxitemsize);
+
+		/*
+		 * itupprev hasn't had its posting list TIDs copied into ipd yet (must
+		 * have been first on page and/or in new posting list?).  Do so now.
+		 *
+		 * This is delayed because it wasn't initially clear whether or not
+		 * itupprev would be merged with the next tuple, or stay as-is.  By
+		 * now caller compared it against itup and found that it was equal, so
+		 * we can go ahead and add its TIDs.
+		 */
+		if (!BTreeTupleIsPosting(dedupState->itupprev))
+		{
+			memcpy(dedupState->ipd, dedupState->itupprev,
+				   sizeof(ItemPointerData));
+			dedupState->ntuples++;
+		}
+		else
+		{
+			/* if itupprev is posting, add all its TIDs to the posting list */
+			nposting = BTreeTupleGetNPosting(dedupState->itupprev);
+			memcpy(dedupState->ipd,
+				   BTreeTupleGetPosting(dedupState->itupprev),
+				   sizeof(ItemPointerData) * nposting);
+			dedupState->ntuples += nposting;
+		}
+
+		/* Save info about deduplicated items for future xlog record */
+		dedupState->n_intervals++;
+		/* Save offnum of the first item belongin to the group */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].from = dedupState->itupprev_off;
+		/*
+		 * Update the number of deduplicated items, belonging to this group.
+		 * Count each item just once, no matter if it was posting tuple or not
+		 */
+		dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+	}
+
+	/*
+	 * Add current tup to ipd for pending posting list for new version of
+	 * page.
+	 */
+	if (!BTreeTupleIsPosting(itup))
+	{
+		memcpy(dedupState->ipd + dedupState->ntuples, itup,
+			   sizeof(ItemPointerData));
+		dedupState->ntuples++;
+	}
+	else
+	{
+		/*
+		 * if tuple is posting, add all its TIDs to the pending list that will
+		 * become new posting list later on
+		 */
+		nposting = BTreeTupleGetNPosting(itup);
+		memcpy(dedupState->ipd + dedupState->ntuples,
+			   BTreeTupleGetPosting(itup),
+			   sizeof(ItemPointerData) * nposting);
+		dedupState->ntuples += nposting;
+	}
+
+	/*
+	 * Update the number of deduplicated items, belonging to this group.
+	 * Count each item just once, no matter if it was posting tuple or not
+	 */
+	dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups++;
+
+	/* TODO just a debug message. delete it in final version of the patch */
+	if (itup_offnum != InvalidOffsetNumber)
+		elog(DEBUG4, "_bt_stash_item_tid. N %d : from %u ntups %u",
+				dedupState->n_intervals,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].from,
+				dedupState->dedup_intervals[dedupState->n_intervals - 1].ntups);
+}
+
+/*
  * Read tuples in correct sort order from tuplesort, and load them into
  * btree leaves.
  */
@@ -1141,9 +1282,20 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	bool		load1;
 	TupleDesc	tupdes = RelationGetDescr(wstate->index);
 	int			i,
-				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index);
+				keysz = IndexRelationGetNumberOfKeyAttributes(wstate->index),
+				natts = IndexRelationGetNumberOfAttributes(wstate->index);
 	SortSupport sortKeys;
 	int64		tuples_done = 0;
+	bool		deduplicate = false;
+	BTDedupState *dedupState = NULL;
+
+	/*
+	 * Don't use deduplication for indexes with INCLUDEd columns and unique
+	 * indexes
+	 */
+	deduplicate = (IndexRelationGetNumberOfKeyAttributes(wstate->index) ==
+				   IndexRelationGetNumberOfAttributes(wstate->index) &&
+				   !wstate->index->rd_index->indisunique);
 
 	if (merge)
 	{
@@ -1257,19 +1409,88 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	}
 	else
 	{
-		/* merge is unnecessary */
-		while ((itup = tuplesort_getindextuple(btspool->sortstate,
-											   true)) != NULL)
+		if (!deduplicate)
 		{
-			/* When we see first tuple, create first index page */
-			if (state == NULL)
-				state = _bt_pagestate(wstate, 0);
+			/* merge is unnecessary */
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+					state = _bt_pagestate(wstate, 0);
 
-			_bt_buildadd(wstate, state, itup);
+				_bt_buildadd(wstate, state, itup);
 
-			/* Report progress */
-			pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
-										 ++tuples_done);
+				/* Report progress */
+				pgstat_progress_update_param(PROGRESS_CREATEIDX_TUPLES_DONE,
+											 ++tuples_done);
+			}
+		}
+		else
+		{
+			/* init deduplication state needed to build posting tuples */
+			dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+			dedupState->ipd = NULL;
+			dedupState->ntuples = 0;
+			dedupState->itupprev = NULL;
+			dedupState->maxitemsize = 0;
+			dedupState->maxpostingsize = 0;
+
+			while ((itup = tuplesort_getindextuple(btspool->sortstate,
+												   true)) != NULL)
+			{
+				/* When we see first tuple, create first index page */
+				if (state == NULL)
+				{
+					state = _bt_pagestate(wstate, 0);
+					dedupState->maxitemsize = BTMaxItemSize(state->btps_page);
+				}
+
+				if (dedupState->itupprev != NULL)
+				{
+					int			n_equal_atts = _bt_keep_natts_fast(wstate->index,
+																   dedupState->itupprev, itup);
+
+					if (n_equal_atts > natts)
+					{
+						/*
+						 * Tuples are equal. Create or update posting.
+						 *
+						 * Else If posting is too big, insert it on page and
+						 * continue.
+						 */
+						if ((dedupState->ntuples + 1) * sizeof(ItemPointerData) <
+							dedupState->maxpostingsize)
+							_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+						else
+							_bt_buildadd_posting(wstate, state, dedupState);
+					}
+					else
+					{
+						/*
+						 * Tuples are not equal. Insert itupprev into index.
+						 * Save current tuple for the next iteration.
+						 */
+						_bt_buildadd_posting(wstate, state, dedupState);
+					}
+				}
+
+				/*
+				 * Save the tuple to compare it with the next one and maybe
+				 * unite them into a posting tuple.
+				 */
+				if (dedupState->itupprev)
+					pfree(dedupState->itupprev);
+				dedupState->itupprev = CopyIndexTuple(itup);
+
+				/* compute max size of posting list */
+				dedupState->maxpostingsize = dedupState->maxitemsize -
+					IndexInfoFindDataOffset(dedupState->itupprev->t_info) -
+					MAXALIGN(IndexTupleSize(dedupState->itupprev));
+			}
+
+			/* Handle the last item */
+			_bt_buildadd_posting(wstate, state, dedupState);
 		}
 	}
 
diff --git a/src/backend/access/nbtree/nbtsplitloc.c b/src/backend/access/nbtree/nbtsplitloc.c
index 1c1029b..54cecc8 100644
--- a/src/backend/access/nbtree/nbtsplitloc.c
+++ b/src/backend/access/nbtree/nbtsplitloc.c
@@ -183,6 +183,9 @@ _bt_findsplitloc(Relation rel,
 	state.minfirstrightsz = SIZE_MAX;
 	state.newitemoff = newitemoff;
 
+	/* newitem cannot be a posting list item */
+	Assert(!BTreeTupleIsPosting(newitem));
+
 	/*
 	 * maxsplits should never exceed maxoff because there will be at most as
 	 * many candidate split points as there are points _between_ tuples, once
@@ -459,17 +462,52 @@ _bt_recsplitloc(FindSplitData *state,
 	int16		leftfree,
 				rightfree;
 	Size		firstrightitemsz;
+	Size		postingsubhikey = 0;
 	bool		newitemisfirstonright;
 
 	/* Is the new item going to be the first item on the right page? */
 	newitemisfirstonright = (firstoldonright == state->newitemoff
 							 && !newitemonleft);
 
+	/*
+	 * FIXME: Accessing every single tuple like this adds cycles to cases that
+	 * cannot possibly benefit (i.e. cases where we know that there cannot be
+	 * posting lists).  Maybe we should add a way to not bother when we are
+	 * certain that this is the case.
+	 *
+	 * We could either have _bt_split() pass us a flag, or invent a page flag
+	 * that indicates that the page might have posting lists, as an
+	 * optimization.  There is no shortage of btpo_flags bits for stuff like
+	 * this.
+	 */
 	if (newitemisfirstonright)
+	{
 		firstrightitemsz = state->newitemsz;
+
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf && BTreeTupleIsPosting(state->newitem))
+			postingsubhikey = IndexTupleSize(state->newitem) -
+				BTreeTupleGetPostingOffset(state->newitem);
+	}
 	else
+	{
 		firstrightitemsz = firstoldonrightsz;
 
+		/* Calculate posting list overhead, if any */
+		if (state->is_leaf)
+		{
+			ItemId		itemid;
+			IndexTuple	newhighkey;
+
+			itemid = PageGetItemId(state->page, firstoldonright);
+			newhighkey = (IndexTuple) PageGetItem(state->page, itemid);
+
+			if (BTreeTupleIsPosting(newhighkey))
+				postingsubhikey = IndexTupleSize(newhighkey) -
+					BTreeTupleGetPostingOffset(newhighkey);
+		}
+	}
+
 	/* Account for all the old tuples */
 	leftfree = state->leftspace - olddataitemstoleft;
 	rightfree = state->rightspace -
@@ -492,9 +530,13 @@ _bt_recsplitloc(FindSplitData *state,
 	 * adding a heap TID to the left half's new high key when splitting at the
 	 * leaf level.  In practice the new high key will often be smaller and
 	 * will rarely be larger, but conservatively assume the worst case.
+	 * Truncation always truncates away any posting list that appears in the
+	 * first right tuple, though, so it's safe to subtract that overhead
+	 * (while still conservatively assuming that truncation might have to add
+	 * back a single heap TID using the pivot tuple heap TID representation).
 	 */
 	if (state->is_leaf)
-		leftfree -= (int16) (firstrightitemsz +
+		leftfree -= (int16) ((firstrightitemsz - postingsubhikey) +
 							 MAXALIGN(sizeof(ItemPointerData)));
 	else
 		leftfree -= (int16) firstrightitemsz;
@@ -691,7 +733,8 @@ _bt_afternewitemoff(FindSplitData *state, OffsetNumber maxoff,
 	itemid = PageGetItemId(state->page, OffsetNumberPrev(state->newitemoff));
 	tup = (IndexTuple) PageGetItem(state->page, itemid);
 	/* Do cheaper test first */
-	if (!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
+	if (BTreeTupleIsPosting(tup) ||
+		!_bt_adjacenthtid(&tup->t_tid, &state->newitem->t_tid))
 		return false;
 	/* Check same conditions as rightmost item case, too */
 	keepnatts = _bt_keep_natts_fast(state->rel, tup, state->newitem);
diff --git a/src/backend/access/nbtree/nbtutils.c b/src/backend/access/nbtree/nbtutils.c
index 4c7b2d0..e3d7f4f 100644
--- a/src/backend/access/nbtree/nbtutils.c
+++ b/src/backend/access/nbtree/nbtutils.c
@@ -97,8 +97,6 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	indoption = rel->rd_indoption;
 	tupnatts = itup ? BTreeTupleGetNAtts(itup, rel) : 0;
 
-	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
-
 	/*
 	 * We'll execute search using scan key constructed on key columns.
 	 * Truncated attributes and non-key attributes are omitted from the final
@@ -110,9 +108,20 @@ _bt_mkscankey(Relation rel, IndexTuple itup)
 	key->anynullkeys = false;	/* initial assumption */
 	key->nextkey = false;
 	key->pivotsearch = false;
+	key->scantid = NULL;
 	key->keysz = Min(indnkeyatts, tupnatts);
-	key->scantid = key->heapkeyspace && itup ?
-		BTreeTupleGetHeapTID(itup) : NULL;
+
+	Assert(tupnatts <= IndexRelationGetNumberOfAttributes(rel));
+	Assert(!itup || !BTreeTupleIsPosting(itup) || key->heapkeyspace);
+
+	/*
+	 * When caller passes a tuple with a heap TID, use it to set scantid. Note
+	 * that this handles posting list tuples by setting scantid to the lowest
+	 * heap TID in the posting list.
+	 */
+	if (itup && key->heapkeyspace)
+		key->scantid = BTreeTupleGetHeapTID(itup);
+
 	skey = key->scankeys;
 	for (i = 0; i < indnkeyatts; i++)
 	{
@@ -1786,10 +1795,35 @@ _bt_killitems(IndexScanDesc scan)
 		{
 			ItemId		iid = PageGetItemId(page, offnum);
 			IndexTuple	ituple = (IndexTuple) PageGetItem(page, iid);
+			bool		killtuple = false;
 
-			if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+			if (BTreeTupleIsPosting(ituple))
 			{
-				/* found the item */
+				int			pi = i + 1;
+				int			nposting = BTreeTupleGetNPosting(ituple);
+				int			j;
+
+				for (j = 0; j < nposting; j++)
+				{
+					ItemPointer item = BTreeTupleGetPostingN(ituple, j);
+
+					if (!ItemPointerEquals(item, &kitem->heapTid))
+						break;	/* out of posting list loop */
+
+					/* Read-ahead to later kitems */
+					if (pi < numKilled)
+						kitem = &so->currPos.items[so->killedItems[pi++]];
+				}
+
+				if (j == nposting)
+					killtuple = true;
+			}
+			else if (ItemPointerEquals(&ituple->t_tid, &kitem->heapTid))
+				killtuple = true;
+
+			if (killtuple)
+			{
+				/* found the item/all posting list items */
 				ItemIdMarkDead(iid);
 				killedsomething = true;
 				break;			/* out of inner search loop */
@@ -2145,6 +2179,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 
 		pivot = index_truncate_tuple(itupdesc, firstright, keepnatts);
 
+		if (BTreeTupleIsPosting(firstright))
+		{
+			BTreeTupleClearBtIsPosting(pivot);
+			BTreeTupleSetNAtts(pivot, keepnatts);
+			if (keepnatts == natts)
+			{
+				/*
+				 * index_truncate_tuple() just returned a copy of the
+				 * original, so make sure that the size of the new pivot tuple
+				 * doesn't have posting list overhead
+				 */
+				pivot->t_info &= ~INDEX_SIZE_MASK;
+				pivot->t_info |= MAXALIGN(BTreeTupleGetPostingOffset(firstright));
+			}
+		}
+
+		Assert(!BTreeTupleIsPosting(pivot));
+
 		/*
 		 * If there is a distinguishing key attribute within new pivot tuple,
 		 * there is no need to add an explicit heap TID attribute
@@ -2161,6 +2213,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * attribute to the new pivot tuple.
 		 */
 		Assert(natts != nkeyatts);
+		Assert(!BTreeTupleIsPosting(lastleft) &&
+			   !BTreeTupleIsPosting(firstright));
 		newsize = IndexTupleSize(pivot) + MAXALIGN(sizeof(ItemPointerData));
 		tidpivot = palloc0(newsize);
 		memcpy(tidpivot, pivot, IndexTupleSize(pivot));
@@ -2168,6 +2222,24 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		pfree(pivot);
 		pivot = tidpivot;
 	}
+	else if (BTreeTupleIsPosting(firstright))
+	{
+		/*
+		 * No truncation was possible, since key attributes are all equal.  We
+		 * can always truncate away a posting list, though.
+		 *
+		 * It's necessary to add a heap TID attribute to the new pivot tuple.
+		 */
+		newsize = MAXALIGN(BTreeTupleGetPostingOffset(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
+		pivot = palloc0(newsize);
+		memcpy(pivot, firstright, BTreeTupleGetPostingOffset(firstright));
+
+		pivot->t_info &= ~INDEX_SIZE_MASK;
+		pivot->t_info |= newsize;
+		BTreeTupleClearBtIsPosting(pivot);
+		BTreeTupleSetAltHeapTID(pivot);
+	}
 	else
 	{
 		/*
@@ -2175,7 +2247,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 		 * It's necessary to add a heap TID attribute to the new pivot tuple.
 		 */
 		Assert(natts == nkeyatts);
-		newsize = IndexTupleSize(firstright) + MAXALIGN(sizeof(ItemPointerData));
+		newsize = MAXALIGN(IndexTupleSize(firstright)) +
+			MAXALIGN(sizeof(ItemPointerData));
 		pivot = palloc0(newsize);
 		memcpy(pivot, firstright, IndexTupleSize(firstright));
 	}
@@ -2193,6 +2266,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * nbtree (e.g., there is no pg_attribute entry).
 	 */
 	Assert(itup_key->heapkeyspace);
+	Assert(!BTreeTupleIsPosting(pivot));
 	pivot->t_info &= ~INDEX_SIZE_MASK;
 	pivot->t_info |= newsize;
 
@@ -2205,7 +2279,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	pivotheaptid = (ItemPointer) ((char *) pivot + newsize -
 								  sizeof(ItemPointerData));
-	ItemPointerCopy(&lastleft->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetMaxTID(lastleft), pivotheaptid);
 
 	/*
 	 * Lehman and Yao require that the downlink to the right page, which is to
@@ -2216,9 +2290,12 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * tiebreaker.
 	 */
 #ifndef DEBUG_NO_TRUNCATE
-	Assert(ItemPointerCompare(&lastleft->t_tid, &firstright->t_tid) < 0);
-	Assert(ItemPointerCompare(pivotheaptid, &lastleft->t_tid) >= 0);
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(BTreeTupleGetMaxTID(lastleft),
+							  BTreeTupleGetHeapTID(firstright)) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(lastleft)) >= 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #else
 
 	/*
@@ -2231,7 +2308,7 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 * attribute values along with lastleft's heap TID value when lastleft's
 	 * TID happens to be greater than firstright's TID.
 	 */
-	ItemPointerCopy(&firstright->t_tid, pivotheaptid);
+	ItemPointerCopy(BTreeTupleGetHeapTID(firstright), pivotheaptid);
 
 	/*
 	 * Pivot heap TID should never be fully equal to firstright.  Note that
@@ -2240,7 +2317,8 @@ _bt_truncate(Relation rel, IndexTuple lastleft, IndexTuple firstright,
 	 */
 	ItemPointerSetOffsetNumber(pivotheaptid,
 							   OffsetNumberPrev(ItemPointerGetOffsetNumber(pivotheaptid)));
-	Assert(ItemPointerCompare(pivotheaptid, &firstright->t_tid) < 0);
+	Assert(ItemPointerCompare(pivotheaptid,
+							  BTreeTupleGetHeapTID(firstright)) < 0);
 #endif
 
 	BTreeTupleSetNAtts(pivot, nkeyatts);
@@ -2321,15 +2399,25 @@ _bt_keep_natts(Relation rel, IndexTuple lastleft, IndexTuple firstright,
  * The approach taken here usually provides the same answer as _bt_keep_natts
  * will (for the same pair of tuples from a heapkeyspace index), since the
  * majority of btree opclasses can never indicate that two datums are equal
- * unless they're bitwise equal (once detoasted).  Similarly, result may
- * differ from the _bt_keep_natts result when either tuple has TOASTed datums,
- * though this is barely possible in practice.
+ * unless they're bitwise equal after detoasting.
  *
  * These issues must be acceptable to callers, typically because they're only
  * concerned about making suffix truncation as effective as possible without
  * leaving excessive amounts of free space on either side of page split.
  * Callers can rely on the fact that attributes considered equal here are
  * definitely also equal according to _bt_keep_natts.
+ *
+ * When an index only uses opclasses where equality is "precise", this
+ * function is guaranteed to give the same result as _bt_keep_natts().  This
+ * makes it safe to use this function to determine whether or not two tuples
+ * can be folded together into a single posting tuple.  Posting list
+ * deduplication cannot be used with nondeterministic collations for this
+ * reason.
+ *
+ * FIXME: Actually invent the needed "equality-is-precise" opclass
+ * infrastructure.  See dedicated -hackers thread:
+ *
+ * https://postgr.es/m/cah2-wzn3ee49gmxb7v1vj3-ac8fwn-fr8pfwqebhe8ryrxt...@mail.gmail.com
  */
 int
 _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
@@ -2354,8 +2442,38 @@ _bt_keep_natts_fast(Relation rel, IndexTuple lastleft, IndexTuple firstright)
 		if (isNull1 != isNull2)
 			break;
 
+		/*
+		 * XXX: The ideal outcome from the point of view of the posting list
+		 * patch is that the definition of an opclass with "precise equality"
+		 * becomes: "equality operator function must give exactly the same
+		 * answer as datum_image_eq() would, provided that we aren't using a
+		 * nondeterministic collation". (Nondeterministic collations are
+		 * clearly not compatible with deduplication.)
+		 *
+		 * This will be a lot faster than actually using the authoritative
+		 * insertion scankey in some cases.  This approach also seems more
+		 * elegant, since suffix truncation gets to follow exactly the same
+		 * definition of "equal" as posting list deduplication -- there is a
+		 * subtle interplay between deduplication and suffix truncation, and
+		 * it would be nice to know for sure that they have exactly the same
+		 * idea about what equality is.
+		 *
+		 * This ideal outcome still avoids problems with TOAST.  We cannot
+		 * repeat bugs like the amcheck bug that was fixed in bugfix commit
+		 * eba775345d23d2c999bbb412ae658b6dab36e3e8.  datum_image_eq()
+		 * considers binary equality, though only _after_ each datum is
+		 * decompressed.
+		 *
+		 * If this ideal solution isn't possible, then we can fall back on
+		 * defining "precise equality" as: "type's output function must
+		 * produce identical textual output for any two datums that compare
+		 * equal when using a safe/equality-is-precise operator class (unless
+		 * using a nondeterministic collation)".  That would mean that we'd
+		 * have to make deduplication call _bt_keep_natts() instead (or some
+		 * other function that uses authoritative insertion scankey).
+		 */
 		if (!isNull1 &&
-			!datumIsEqual(datum1, datum2, att->attbyval, att->attlen))
+			!datum_image_eq(datum1, datum2, att->attbyval, att->attlen))
 			break;
 
 		keepnatts++;
@@ -2407,22 +2525,30 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
 	tupnatts = BTreeTupleGetNAtts(itup, rel);
 
+	/* !heapkeyspace indexes do not support deduplication */
+	if (!heapkeyspace && BTreeTupleIsPosting(itup))
+		return false;
+
+	/* INCLUDE indexes do not support deduplication */
+	if (natts != nkeyatts && BTreeTupleIsPosting(itup))
+		return false;
+
 	if (P_ISLEAF(opaque))
 	{
 		if (offnum >= P_FIRSTDATAKEY(opaque))
 		{
 			/*
-			 * Non-pivot tuples currently never use alternative heap TID
-			 * representation -- even those within heapkeyspace indexes
+			 * Non-pivot tuple should never be explicitly marked as a pivot
+			 * tuple
 			 */
-			if ((itup->t_info & INDEX_ALT_TID_MASK) != 0)
+			if (BTreeTupleIsPivot(itup))
 				return false;
 
 			/*
 			 * Leaf tuples that are not the page high key (non-pivot tuples)
 			 * should never be truncated.  (Note that tupnatts must have been
-			 * inferred, rather than coming from an explicit on-disk
-			 * representation.)
+			 * inferred, even with a posting list tuple, because only pivot
+			 * tuples store tupnatts directly.)
 			 */
 			return tupnatts == natts;
 		}
@@ -2466,12 +2592,12 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 			 * non-zero, or when there is no explicit representation and the
 			 * tuple is evidently not a pre-pg_upgrade tuple.
 			 *
-			 * Prior to v11, downlinks always had P_HIKEY as their offset. Use
-			 * that to decide if the tuple is a pre-v11 tuple.
+			 * Prior to v11, downlinks always had P_HIKEY as their offset.
+			 * Accept that as an alternative indication of a valid
+			 * !heapkeyspace negative infinity tuple.
 			 */
 			return tupnatts == 0 ||
-				((itup->t_info & INDEX_ALT_TID_MASK) == 0 &&
-				 ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY);
+				ItemPointerGetOffsetNumber(&(itup->t_tid)) == P_HIKEY;
 		}
 		else
 		{
@@ -2497,7 +2623,11 @@ _bt_check_natts(Relation rel, bool heapkeyspace, Page page, OffsetNumber offnum)
 	 * heapkeyspace index pivot tuples, regardless of whether or not there are
 	 * non-key attributes.
 	 */
-	if ((itup->t_info & INDEX_ALT_TID_MASK) == 0)
+	if (!BTreeTupleIsPivot(itup))
+		return false;
+
+	/* Pivot tuple should not use posting list representation (redundant) */
+	if (BTreeTupleIsPosting(itup))
 		return false;
 
 	/*
@@ -2567,11 +2697,87 @@ _bt_check_third_page(Relation rel, Relation heap, bool needheaptidspace,
 					BTMaxItemSizeNoHeapTid(page),
 					RelationGetRelationName(rel)),
 			 errdetail("Index row references tuple (%u,%u) in relation \"%s\".",
-					   ItemPointerGetBlockNumber(&newtup->t_tid),
-					   ItemPointerGetOffsetNumber(&newtup->t_tid),
+					   ItemPointerGetBlockNumber(BTreeTupleGetHeapTID(newtup)),
+					   ItemPointerGetOffsetNumber(BTreeTupleGetHeapTID(newtup)),
 					   RelationGetRelationName(heap)),
 			 errhint("Values larger than 1/3 of a buffer page cannot be indexed.\n"
 					 "Consider a function index of an MD5 hash of the value, "
 					 "or use full text indexing."),
 			 errtableconstraint(heap, RelationGetRelationName(rel))));
 }
+
+/*
+ * Given a basic tuple that contains key datum and posting list,
+ * build a posting tuple.
+ *
+ * Basic tuple can be a posting tuple, but we only use key part of it,
+ * all ItemPointers must be passed via ipd.
+ *
+ * If nipd == 1 fallback to building a non-posting tuple.
+ * It is necessary to avoid storage overhead after posting tuple was vacuumed.
+ */
+IndexTuple
+BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd, int nipd)
+{
+	uint32		keysize,
+				newsize = 0;
+	IndexTuple	itup;
+
+	/* We only need key part of the tuple */
+	if (BTreeTupleIsPosting(tuple))
+		keysize = BTreeTupleGetPostingOffset(tuple);
+	else
+		keysize = IndexTupleSize(tuple);
+
+	Assert(nipd > 0);
+
+	/* Add space needed for posting list */
+	if (nipd > 1)
+		newsize = SHORTALIGN(keysize) + sizeof(ItemPointerData) * nipd;
+	else
+		newsize = keysize;
+
+	newsize = MAXALIGN(newsize);
+	itup = palloc0(newsize);
+	memcpy(itup, tuple, keysize);
+	itup->t_info &= ~INDEX_SIZE_MASK;
+	itup->t_info |= newsize;
+
+	if (nipd > 1)
+	{
+		/* Form posting tuple, fill posting fields */
+
+		/* Set meta info about the posting list */
+		itup->t_info |= INDEX_ALT_TID_MASK;
+		BTreeSetPostingMeta(itup, nipd, SHORTALIGN(keysize));
+
+		/* sort the list to preserve TID order invariant */
+		qsort((void *) ipd, nipd, sizeof(ItemPointerData),
+			  (int (*) (const void *, const void *)) ItemPointerCompare);
+
+		/* Copy posting list into the posting tuple */
+		memcpy(BTreeTupleGetPosting(itup), ipd,
+			   sizeof(ItemPointerData) * nipd);
+	}
+	else
+	{
+		/* To finish building of a non-posting tuple, copy TID from ipd */
+		itup->t_info &= ~INDEX_ALT_TID_MASK;
+		ItemPointerCopy(ipd, &itup->t_tid);
+	}
+
+	return itup;
+}
+
+/*
+ * Opposite of BTreeFormPostingTuple.
+ * returns regular tuple that contains the key,
+ * the tid of the new tuple is the nth tid of original tuple's posting list
+ * result tuple palloc'd in a caller's context.
+ */
+IndexTuple
+BTreeGetNthTupleOfPosting(IndexTuple tuple, int n)
+{
+	Assert(BTreeTupleIsPosting(tuple));
+	return BTreeFormPostingTuple(tuple, BTreeTupleGetPostingN(tuple, n), 1);
+}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index dd5315c..98ce964 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -181,9 +181,35 @@ btree_xlog_insert(bool isleaf, bool ismeta, XLogReaderState *record)
 
 		page = BufferGetPage(buffer);
 
-		if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
-						false, false) == InvalidOffsetNumber)
-			elog(PANIC, "btree_xlog_insert: failed to add item");
+		if (xlrec->in_posting_offset != InvalidOffsetNumber)
+		{
+			/* oposting must be at offset before new item */
+			ItemId		itemid = PageGetItemId(page, OffsetNumberPrev(xlrec->offnum));
+			IndexTuple oposting = (IndexTuple) PageGetItem(page, itemid);
+			IndexTuple newitem = (IndexTuple) datapos;
+			IndexTuple nposting;
+
+			nposting = _bt_form_newposting(newitem, oposting,
+										   xlrec->in_posting_offset);
+			Assert(isleaf);
+
+			Assert(MAXALIGN(IndexTupleSize(oposting)) ==
+				   MAXALIGN(IndexTupleSize(nposting)));
+
+			/* replace existing posting */
+			memcpy(oposting, nposting, MAXALIGN(IndexTupleSize(nposting)));
+
+			/* insert new item */
+			if (PageAddItem(page, (Item) newitem, MAXALIGN(IndexTupleSize(newitem)),
+							xlrec->offnum, false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
+		else
+		{
+			if (PageAddItem(page, (Item) datapos, datalen, xlrec->offnum,
+							false, false) == InvalidOffsetNumber)
+				elog(PANIC, "btree_xlog_insert: failed to add item");
+		}
 
 		PageSetLSN(page, lsn);
 		MarkBufferDirty(buffer);
@@ -265,20 +291,45 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 		BTPageOpaque lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
 		OffsetNumber off;
 		IndexTuple	newitem = NULL,
-					left_hikey = NULL;
+					left_hikey = NULL,
+					nposting = NULL;
 		Size		newitemsz = 0,
 					left_hikeysz = 0;
 		Page		newlpage;
-		OffsetNumber leftoff;
+		OffsetNumber leftoff,
+					 replacepostingoff = InvalidOffsetNumber;
 
 		datapos = XLogRecGetBlockData(record, 0, &datalen);
 
-		if (onleft)
+		if (onleft || xlrec->in_posting_offset)
 		{
 			newitem = (IndexTuple) datapos;
 			newitemsz = MAXALIGN(IndexTupleSize(newitem));
 			datapos += newitemsz;
 			datalen -= newitemsz;
+
+			/*
+			 * Repeat logic implemented in _bt_insertonpg():
+			 *
+			 * If the new tuple is a duplicate with a heap TID that falls
+			 * inside the range of an existing posting list tuple,
+			 * generate new posting tuple to replace original one
+			 * and update new tuple so that it's heap TID contains
+			 * the rightmost heap TID of original posting tuple.
+			 */
+			if (xlrec->in_posting_offset != 0)
+			{
+				ItemId		itemid = PageGetItemId(lpage, OffsetNumberPrev(xlrec->newitemoff));
+				IndexTuple oposting = (IndexTuple) PageGetItem(lpage, itemid);
+
+				nposting = _bt_form_newposting(newitem, oposting,
+											xlrec->in_posting_offset);
+
+				/* Alter new item offset, since effective new item changed */
+				replacepostingoff = OffsetNumberPrev(xlrec->newitemoff);
+
+				Assert(BTreeTupleGetNPosting(nposting) == BTreeTupleGetNPosting(oposting));
+			}
 		}
 
 		/* Extract left hikey and its size (assuming 16-bit alignment) */
@@ -304,6 +355,15 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 			Size		itemsz;
 			IndexTuple	item;
 
+			if (off == replacepostingoff)
+			{
+				if (PageAddItem(newlpage, (Item) nposting, MAXALIGN(IndexTupleSize(nposting)),
+								leftoff, false, false) == InvalidOffsetNumber)
+					elog(ERROR, "failed to add new item to left page after split");
+				leftoff = OffsetNumberNext(leftoff);
+				continue;
+			}
+
 			/* add the new item if it was inserted on left page */
 			if (onleft && off == xlrec->newitemoff)
 			{
@@ -380,14 +440,146 @@ btree_xlog_split(bool onleft, XLogReaderState *record)
 }
 
 static void
+btree_xlog_dedup(XLogReaderState *record)
+{
+	XLogRecPtr	lsn = record->EndRecPtr;
+	Buffer		buf;
+	Page		newpage;
+	xl_btree_dedup *xlrec = (xl_btree_dedup *) XLogRecGetData(record);
+
+	if (XLogReadBufferForRedo(record, 0, &buf) == BLK_NEEDS_REDO)
+	{
+		/*
+		 * Initialize a temporary empty page and copy all the items
+		 * to that in item number order.
+		 */
+		Page		page = (Page) BufferGetPage(buf);
+		BTPageOpaque opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+		BTPageOpaque nopaque;
+		OffsetNumber offnum, minoff, maxoff;
+		BTDedupState *dedupState = NULL;
+		char *data = ((char *) xlrec + SizeOfBtreeDedup);
+		dedupInterval dedup_intervals[MaxOffsetNumber];
+		int			 nth_interval = 0;
+		OffsetNumber n_dedup_tups = 0;
+
+		dedupState = (BTDedupState *) palloc0(sizeof(BTDedupState));
+		dedupState->ipd = NULL;
+		dedupState->ntuples = 0;
+		dedupState->itupprev = NULL;
+		dedupState->maxitemsize = BTMaxItemSize(page);
+		dedupState->maxpostingsize = 0;
+
+		memcpy(dedup_intervals, data,
+			   xlrec->n_intervals*sizeof(dedupInterval));
+
+		/* Scan over all items to see which ones can be deduplicated */
+		minoff = P_FIRSTDATAKEY(opaque);
+		maxoff = PageGetMaxOffsetNumber(page);
+		newpage = PageGetTempPageCopySpecial(page);
+		nopaque = (BTPageOpaque) PageGetSpecialPointer(newpage);
+
+		/* Make sure that new page won't have garbage flag set */
+		nopaque->btpo_flags &= ~BTP_HAS_GARBAGE;
+
+		/* Copy High Key if any */
+		if (!P_RIGHTMOST(opaque))
+		{
+			ItemId		itemid = PageGetItemId(page, P_HIKEY);
+			Size		itemsz = ItemIdGetLength(itemid);
+			IndexTuple	item = (IndexTuple) PageGetItem(page, itemid);
+
+			if (PageAddItem(newpage, (Item) item, itemsz, P_HIKEY,
+							false, false) == InvalidOffsetNumber)
+				elog(ERROR, "failed to add highkey during deduplication");
+		}
+
+		/*
+		* Iterate over tuples on the page to deduplicate them into posting
+		* lists and insert into new page
+		*/
+		for (offnum = minoff;
+			 offnum <= maxoff;
+			 offnum = OffsetNumberNext(offnum))
+		{
+			ItemId		itemId = PageGetItemId(page, offnum);
+			IndexTuple	itup = (IndexTuple) PageGetItem(page, itemId);
+
+			elog(DEBUG4, "btree_xlog_dedup. offnum %u, n_intervals %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+			if (dedupState->itupprev == NULL)
+			{
+				/* Just set up base/first item in first iteration */
+				Assert(offnum == minoff);
+				dedupState->itupprev = CopyIndexTuple(itup);
+				dedupState->itupprev_off = offnum;
+				continue;
+			}
+
+			/*
+			 * Instead of comparing tuple's keys, which may be costly, use
+			 * information from xlog record. If current tuple belongs to the
+			 * group of deduplicated items, repeat logic of _bt_dedup_one_page
+			 * and stash it to form a posting list afterwards.
+			 */
+			if (dedupState->itupprev_off >= dedup_intervals[nth_interval].from
+				&& n_dedup_tups < dedup_intervals[nth_interval].ntups)
+			{
+				_bt_stash_item_tid(dedupState, itup, InvalidOffsetNumber);
+
+				elog(DEBUG4, "btree_xlog_dedup. stash offnum %u, nth_interval %u, from %u ntups %u",
+						offnum,
+						nth_interval,
+						dedup_intervals[nth_interval].from,
+						dedup_intervals[nth_interval].ntups);
+
+				/* count first tuple in the group */
+				if (dedupState->itupprev_off == dedup_intervals[nth_interval].from)
+					n_dedup_tups++;
+
+				/* count added tuple */
+				n_dedup_tups++;
+			}
+			else
+			{
+				_bt_dedup_insert(newpage, dedupState);
+
+				/* reset state */
+				if (n_dedup_tups > 0)
+					nth_interval++;
+				n_dedup_tups = 0;
+			}
+
+			pfree(dedupState->itupprev);
+			dedupState->itupprev = CopyIndexTuple(itup);
+			dedupState->itupprev_off = offnum;
+		}
+
+		/* Handle the last item */
+		_bt_dedup_insert(newpage, dedupState);
+
+		PageRestoreTempPage(newpage, page);
+		PageSetLSN(page, lsn);
+		MarkBufferDirty(buf);
+	}
+
+	if (BufferIsValid(buf))
+		UnlockReleaseBuffer(buf);
+}
+
+static void
 btree_xlog_vacuum(XLogReaderState *record)
 {
 	XLogRecPtr	lsn = record->EndRecPtr;
 	Buffer		buffer;
 	Page		page;
 	BTPageOpaque opaque;
-#ifdef UNUSED
 	xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
+#ifdef UNUSED
 
 	/*
 	 * This section of code is thought to be no longer needed, after analysis
@@ -478,14 +670,34 @@ btree_xlog_vacuum(XLogReaderState *record)
 
 		if (len > 0)
 		{
-			OffsetNumber *unused;
-			OffsetNumber *unend;
+			if (xlrec->nremaining)
+			{
+				OffsetNumber *remainingoffset;
+				IndexTuple	remaining;
+				Size		itemsz;
+
+				remainingoffset = (OffsetNumber *)
+					(ptr + xlrec->ndeleted * sizeof(OffsetNumber));
+				remaining = (IndexTuple) ((char *) remainingoffset +
+										  xlrec->nremaining * sizeof(OffsetNumber));
+
+				/* Handle posting tuples */
+				for (int i = 0; i < xlrec->nremaining; i++)
+				{
+					PageIndexTupleDelete(page, remainingoffset[i]);
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+					itemsz = MAXALIGN(IndexTupleSize(remaining));
 
-			if ((unend - unused) > 0)
-				PageIndexMultiDelete(page, unused, unend - unused);
+					if (PageAddItem(page, (Item) remaining, itemsz, remainingoffset[i],
+									false, false) == InvalidOffsetNumber)
+						elog(PANIC, "btree_xlog_vacuum: failed to add remaining item");
+
+					remaining = (IndexTuple) ((char *) remaining + itemsz);
+				}
+			}
+
+			if (xlrec->ndeleted)
+				PageIndexMultiDelete(page, (OffsetNumber *) ptr, xlrec->ndeleted);
 		}
 
 		/*
@@ -838,6 +1050,9 @@ btree_redo(XLogReaderState *record)
 		case XLOG_BTREE_SPLIT_R:
 			btree_xlog_split(false, record);
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			btree_xlog_dedup(record);
+			break;
 		case XLOG_BTREE_VACUUM:
 			btree_xlog_vacuum(record);
 			break;
diff --git a/src/backend/access/rmgrdesc/nbtdesc.c b/src/backend/access/rmgrdesc/nbtdesc.c
index a14eb79..802e27b 100644
--- a/src/backend/access/rmgrdesc/nbtdesc.c
+++ b/src/backend/access/rmgrdesc/nbtdesc.c
@@ -30,7 +30,8 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_insert *xlrec = (xl_btree_insert *) rec;
 
-				appendStringInfo(buf, "off %u", xlrec->offnum);
+				appendStringInfo(buf, "off %u; in_posting_offset %u",
+								 xlrec->offnum, xlrec->in_posting_offset);
 				break;
 			}
 		case XLOG_BTREE_SPLIT_L:
@@ -38,16 +39,27 @@ btree_desc(StringInfo buf, XLogReaderState *record)
 			{
 				xl_btree_split *xlrec = (xl_btree_split *) rec;
 
+				/* FIXME: even master doesn't have newitemoff */
 				appendStringInfo(buf, "level %u, firstright %d",
 								 xlrec->level, xlrec->firstright);
 				break;
 			}
+		case XLOG_BTREE_DEDUP_PAGE:
+			{
+				xl_btree_dedup *xlrec = (xl_btree_dedup *) rec;
+
+				appendStringInfo(buf, "items were deduplicated to %d items",
+								 xlrec->n_intervals);
+				break;
+			}
 		case XLOG_BTREE_VACUUM:
 			{
 				xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;
 
-				appendStringInfo(buf, "lastBlockVacuumed %u",
-								 xlrec->lastBlockVacuumed);
+				appendStringInfo(buf, "lastBlockVacuumed %u; nremaining %u; ndeleted %u",
+								 xlrec->lastBlockVacuumed,
+								 xlrec->nremaining,
+								 xlrec->ndeleted);
 				break;
 			}
 		case XLOG_BTREE_DELETE:
@@ -131,6 +143,9 @@ btree_identify(uint8 info)
 		case XLOG_BTREE_SPLIT_R:
 			id = "SPLIT_R";
 			break;
+		case XLOG_BTREE_DEDUP_PAGE:
+			id = "DEDUPLICATE";
+			break;
 		case XLOG_BTREE_VACUUM:
 			id = "VACUUM";
 			break;
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 52eafe6..d1af18f 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -234,8 +234,7 @@ typedef struct BTMetaPageData
  *  t_tid | t_info | key values | INCLUDE columns, if any
  *
  * t_tid points to the heap TID, which is a tiebreaker key column as of
- * BTREE_VERSION 4.  Currently, the INDEX_ALT_TID_MASK status bit is never
- * set for non-pivot tuples.
+ * BTREE_VERSION 4.
  *
  * All other types of index tuples ("pivot" tuples) only have key columns,
  * since pivot tuples only exist to represent how the key space is
@@ -252,6 +251,38 @@ typedef struct BTMetaPageData
  * omitted rather than truncated, since its representation is different to
  * the non-pivot representation.)
  *
+ * Non-pivot posting tuple format:
+ *  t_tid | t_info | key values | INCLUDE columns, if any | posting_list[]
+ *
+ * In order to store duplicated keys more effectively, we use special format
+ * of tuples - posting tuples.  posting_list is an array of ItemPointerData.
+ *
+ * Deduplication never applies to unique indexes or indexes with INCLUDEd
+ * columns.
+ *
+ * To differ posting tuples we use INDEX_ALT_TID_MASK flag in t_info and
+ * BT_IS_POSTING flag in t_tid.
+ * These flags redefine the content of the posting tuple's tid:
+ * - t_tid.ip_blkid contains offset of the posting list.
+ * - t_tid offset field contains number of posting items this tuple contain
+ *
+ * The 12 least significant offset bits from t_tid are used to represent
+ * the number of posting items in posting tuples, leaving 4 status
+ * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
+ * future use.
+ * BT_N_POSTING_OFFSET_MASK is large enough to store any number of posting
+ * tuples, which is constrainted by BTMaxItemSize.
+
+ * If page contains so many duplicates, that they do not fit into one posting
+ * tuple (bounded by BTMaxItemSize and ), page may contain several posting
+ * tuples with the same key.
+ * Also page can contain both posting and non-posting tuples with the same key.
+ * Currently, posting tuples always contain at least two TIDs in the posting
+ * list.
+ *
+ * Posting tuples always have the same number of attributes as the index has
+ * generally.
+ *
  * Pivot tuple format:
  *
  *  t_tid | t_info | key values | [heap TID]
@@ -281,23 +312,145 @@ typedef struct BTMetaPageData
  * bits (BT_RESERVED_OFFSET_MASK bits), 3 of which that are reserved for
  * future use.  BT_N_KEYS_OFFSET_MASK should be large enough to store any
  * number of columns/attributes <= INDEX_MAX_KEYS.
+ * BT_IS_POSTING bit must be unset for pivot tuples, since we use it
+ * to distinct posting tuples from pivot tuples.
  *
  * Note well: The macros that deal with the number of attributes in tuples
- * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple,
- * and that a tuple without INDEX_ALT_TID_MASK set must be a non-pivot
- * tuple (or must have the same number of attributes as the index has
- * generally in the case of !heapkeyspace indexes).  They will need to be
- * updated if non-pivot tuples ever get taught to use INDEX_ALT_TID_MASK
- * for something else.
+ * assume that a tuple with INDEX_ALT_TID_MASK set must be a pivot tuple or
+ * non-pivot posting tuple, and that a tuple without INDEX_ALT_TID_MASK set
+ * must be a non-pivot tuple (or must have the same number of attributes as
+ * the index has generally in the case of !heapkeyspace indexes).
  */
 #define INDEX_ALT_TID_MASK			INDEX_AM_RESERVED_BIT
 
 /* Item pointer offset bits */
 #define BT_RESERVED_OFFSET_MASK		0xF000
 #define BT_N_KEYS_OFFSET_MASK		0x0FFF
+#define BT_N_POSTING_OFFSET_MASK	0x0FFF
 #define BT_HEAP_TID_ATTR			0x1000
+#define BT_IS_POSTING				0x2000
+
+/*
+ * MaxPostingIndexTuplesPerPage is an upper bound on the number of tuples
+ * that can fit on one btree leaf page.
+ *
+ * Btree leaf pages may contain posting tuples, which store duplicates
+ * in a more effective way, so MaxPostingIndexTuplesPerPage is larger then
+ * MaxIndexTuplesPerPage.
+ *
+ * Each leaf page must contain at least three items, so estimate it as
+ * if we have three posting tuples with minimal size keys.
+ */
+#define MaxPostingIndexTuplesPerPage \
+	((int) ((BLCKSZ - SizeOfPageHeaderData - \
+			3*((MAXALIGN(sizeof(IndexTupleData) + 1) + sizeof(ItemIdData))) )) / \
+			(sizeof(ItemPointerData)))
+
+/*
+ * Helper for BTDedupState.
+ * Each entry represents a group of 'ntups' consecutive items starting on
+ * 'from' offset that were deduplicated into a single posting tuple.
+ */
+typedef struct dedupInterval
+{
+	OffsetNumber from;
+	OffsetNumber ntups;
+} dedupInterval;
+
+/*
+ * Btree-private state needed to build posting tuples.
+ * ipd is a posting list - an array of ItemPointerData.
+ *
+ * Iterating over tuples during index build or applying deduplication to a
+ * single page, we remember a tuple in itupprev, then compare the next one
+ * with it.  If tuples are equal, save their TIDs in the posting list.
+ * ntuples contains the size of the posting list.
+ *
+ * Use maxitemsize and maxpostingsize to ensure that resulting posting tuple
+ * will satisfy BTMaxItemSize.
+ */
+typedef struct BTDedupState
+{
+	Size		maxitemsize;
+	Size		maxpostingsize;
+	IndexTuple	itupprev;
+
+	/*
+	 * array with info about deduplicated items on the page.
+	 *
+	 * It contains one entry for each group of consecutive items that
+	 * were deduplicated into a single posting tuple.
+	 *
+	 * This array is saved to xlog entry, which allows to replay
+	 * deduplication faster without actually comparing tuple's keys.
+	 */
+	dedupInterval dedup_intervals[MaxOffsetNumber];
+	/* current number of items in dedup_intervals array */
+	int			n_intervals;
+	/* temp state variable to keep a 'possible' start of dedup interval */
+	OffsetNumber itupprev_off;
+
+	int			ntuples;
+	ItemPointerData *ipd;
+} BTDedupState;
+
+/*
+ * N.B.: BTreeTupleIsPivot() should only be used in code that deals with
+ * heapkeyspace indexes specifically.  BTreeTupleIsPosting() works with all
+ * nbtree indexes, though.
+ */
+#define BTreeTupleIsPivot(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0))\
+	)
+#define BTreeTupleIsPosting(itup)  \
+	( \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0))\
+	)
+
+#define BTreeTupleClearBtIsPosting(itup) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & ~BT_IS_POSTING); \
+	} while(0)
+
+#define BTreeTupleGetNPosting(itup)	\
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_POSTING_OFFSET_MASK \
+	)
+#define BTreeTupleSetNPosting(itup, n) \
+	do { \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_POSTING_OFFSET_MASK); \
+		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(!((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) != 0)); \
+		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
+			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_IS_POSTING); \
+	} while(0)
 
-/* Get/set downlink block number */
+/*
+ * If tuple is posting, t_tid.ip_blkid contains offset of the posting list
+ */
+#define BTreeTupleGetPostingOffset(itup) \
+	( \
+		AssertMacro(BTreeTupleIsPosting(itup)), \
+		ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid)) \
+	)
+#define BTreeSetPostingMeta(itup, nposting, off) \
+	do { \
+		BTreeTupleSetNPosting(itup, nposting); \
+		Assert(BTreeTupleIsPosting(itup)); \
+		ItemPointerSetBlockNumber(&((itup)->t_tid), (off)); \
+	} while(0)
+
+#define BTreeTupleGetPosting(itup) \
+	(ItemPointer) ((char*) (itup) + BTreeTupleGetPostingOffset(itup))
+#define BTreeTupleGetPostingN(itup,n) \
+	(BTreeTupleGetPosting(itup) + (n))
+
+/* Get/set downlink block number  */
 #define BTreeInnerTupleGetDownLink(itup) \
 	ItemPointerGetBlockNumberNoCheck(&((itup)->t_tid))
 #define BTreeInnerTupleSetDownLink(itup, blkno) \
@@ -326,40 +479,73 @@ typedef struct BTMetaPageData
  */
 #define BTreeTupleGetNAtts(itup, rel)	\
 	( \
-		(itup)->t_info & INDEX_ALT_TID_MASK ? \
+		((itup)->t_info & INDEX_ALT_TID_MASK && \
+		((ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_IS_POSTING) == 0)) ? \
 		( \
 			ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_N_KEYS_OFFSET_MASK \
 		) \
 		: \
 		IndexRelationGetNumberOfAttributes(rel) \
 	)
-#define BTreeTupleSetNAtts(itup, n) \
-	do { \
-		(itup)->t_info |= INDEX_ALT_TID_MASK; \
-		ItemPointerSetOffsetNumber(&(itup)->t_tid, (n) & BT_N_KEYS_OFFSET_MASK); \
-	} while(0)
+
+static inline void
+BTreeTupleSetNAtts(IndexTuple itup, int n)
+{
+	Assert(!BTreeTupleIsPosting(itup));
+	itup->t_info |= INDEX_ALT_TID_MASK;
+	ItemPointerSetOffsetNumber(&itup->t_tid, n & BT_N_KEYS_OFFSET_MASK);
+}
 
 /*
- * Get tiebreaker heap TID attribute, if any.  Macro works with both pivot
- * and non-pivot tuples, despite differences in how heap TID is represented.
+ * Get tiebreaker heap TID attribute, if any.  Works with both pivot and
+ * non-pivot tuples, despite differences in how heap TID is represented.
+ *
+ * This returns the first/lowest heap TID in the case of a posting list tuple.
  */
-#define BTreeTupleGetHeapTID(itup) \
-	( \
-	  (itup)->t_info & INDEX_ALT_TID_MASK && \
-	  (ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) & BT_HEAP_TID_ATTR) != 0 ? \
-	  ( \
-		(ItemPointer) (((char *) (itup) + IndexTupleSize(itup)) - \
-					   sizeof(ItemPointerData)) \
-	  ) \
-	  : (itup)->t_info & INDEX_ALT_TID_MASK ? NULL : (ItemPointer) &((itup)->t_tid) \
-	)
+static inline ItemPointer
+BTreeTupleGetHeapTID(IndexTuple itup)
+{
+	if (BTreeTupleIsPivot(itup))
+	{
+		/* Pivot tuple heap TID representation? */
+		if ((ItemPointerGetOffsetNumberNoCheck(&itup->t_tid) &
+			 BT_HEAP_TID_ATTR) != 0)
+			return (ItemPointer) ((char *) itup + IndexTupleSize(itup) -
+								  sizeof(ItemPointerData));
+
+		/* Heap TID attribute was truncated */
+		return NULL;
+	}
+	else if (BTreeTupleIsPosting(itup))
+		return BTreeTupleGetPosting(itup);
+
+	return &(itup->t_tid);
+}
+
+/*
+ * Get maximum heap TID attribute, which could be the only TID in the case of
+ * a non-pivot tuple that does not have a posting list tuple.  Works with
+ * non-pivot tuples only.
+ */
+static inline ItemPointer
+BTreeTupleGetMaxTID(IndexTuple itup)
+{
+	Assert(!BTreeTupleIsPivot(itup));
+
+	if (BTreeTupleIsPosting(itup))
+		return (ItemPointer) (BTreeTupleGetPosting(itup) +
+							  (BTreeTupleGetNPosting(itup) - 1));
+
+	return &(itup->t_tid);
+}
+
 /*
  * Set the heap TID attribute for a tuple that uses the INDEX_ALT_TID_MASK
- * representation (currently limited to pivot tuples)
+ * representation
  */
 #define BTreeTupleSetAltHeapTID(itup) \
 	do { \
-		Assert((itup)->t_info & INDEX_ALT_TID_MASK); \
+		Assert(BTreeTupleIsPivot(itup)); \
 		ItemPointerSetOffsetNumber(&(itup)->t_tid, \
 								   ItemPointerGetOffsetNumberNoCheck(&(itup)->t_tid) | BT_HEAP_TID_ATTR); \
 	} while(0)
@@ -500,6 +686,13 @@ typedef struct BTInsertStateData
 	Buffer		buf;
 
 	/*
+	 * if _bt_binsrch_insert() found the location inside existing posting
+	 * list, save the position inside the list.  This will be -1 in rare cases
+	 * where the overlapping posting list is LP_DEAD.
+	 */
+	int			in_posting_offset;
+
+	/*
 	 * Cache of bounds within the current buffer.  Only used for insertions
 	 * where _bt_check_unique is called.  See _bt_binsrch_insert and
 	 * _bt_findinsertloc for details.
@@ -534,7 +727,9 @@ typedef BTInsertStateData *BTInsertState;
  * If we are doing an index-only scan, we save the entire IndexTuple for each
  * matched item, otherwise only its heap TID and offset.  The IndexTuples go
  * into a separate workspace array; each BTScanPosItem stores its tuple's
- * offset within that array.
+ * offset within that array.  Posting list tuples store a version of the
+ * tuple that does not include the posting list, allowing the same key to be
+ * returned for each logical tuple associated with the posting list.
  */
 
 typedef struct BTScanPosItem	/* what we remember about each match */
@@ -563,9 +758,13 @@ typedef struct BTScanPosData
 
 	/*
 	 * If we are doing an index-only scan, nextTupleOffset is the first free
-	 * location in the associated tuple storage workspace.
+	 * location in the associated tuple storage workspace.  Posting list
+	 * tuples need postingTupleOffset to store the current location of the
+	 * tuple that is returned multiple times (once per heap TID in posting
+	 * list).
 	 */
 	int			nextTupleOffset;
+	int			postingTupleOffset;
 
 	/*
 	 * The items array is always ordered in index order (ie, increasing
@@ -578,7 +777,7 @@ typedef struct BTScanPosData
 	int			lastItem;		/* last valid index in items[] */
 	int			itemIndex;		/* current index in items[] */
 
-	BTScanPosItem items[MaxIndexTuplesPerPage]; /* MUST BE LAST */
+	BTScanPosItem items[MaxPostingIndexTuplesPerPage];	/* MUST BE LAST */
 } BTScanPosData;
 
 typedef BTScanPosData *BTScanPos;
@@ -732,6 +931,9 @@ extern bool _bt_doinsert(Relation rel, IndexTuple itup,
 						 IndexUniqueCheck checkUnique, Relation heapRel);
 extern Buffer _bt_getstackbuf(Relation rel, BTStack stack, BlockNumber child);
 extern void _bt_finish_split(Relation rel, Buffer bbuf, BTStack stack);
+extern IndexTuple _bt_form_newposting(IndexTuple itup, IndexTuple oposting,
+				   OffsetNumber in_posting_offset);
+extern void _bt_dedup_insert(Page page, BTDedupState *dedupState);
 
 /*
  * prototypes for functions in nbtsplitloc.c
@@ -762,6 +964,8 @@ extern void _bt_delitems_delete(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems, Relation heapRel);
 extern void _bt_delitems_vacuum(Relation rel, Buffer buf,
 								OffsetNumber *itemnos, int nitems,
+								OffsetNumber *remainingoffset,
+								IndexTuple *remaining, int nremaining,
 								BlockNumber lastBlockVacuumed);
 extern int	_bt_pagedel(Relation rel, Buffer buf);
 
@@ -812,6 +1016,9 @@ extern bool _bt_check_natts(Relation rel, bool heapkeyspace, Page page,
 							OffsetNumber offnum);
 extern void _bt_check_third_page(Relation rel, Relation heap,
 								 bool needheaptidspace, Page page, IndexTuple newtup);
+extern IndexTuple BTreeFormPostingTuple(IndexTuple tuple, ItemPointerData *ipd,
+										int nipd);
+extern IndexTuple BTreeGetNthTupleOfPosting(IndexTuple tuple, int n);
 
 /*
  * prototypes for functions in nbtvalidate.c
@@ -824,5 +1031,7 @@ extern bool btvalidate(Oid opclassoid);
 extern IndexBuildResult *btbuild(Relation heap, Relation index,
 								 struct IndexInfo *indexInfo);
 extern void _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc);
+extern void _bt_stash_item_tid(BTDedupState *dedupState, IndexTuple itup,
+							   OffsetNumber itup_offnum);
 
 #endif							/* NBTREE_H */
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index afa614d..075baaf 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -28,7 +28,8 @@
 #define XLOG_BTREE_INSERT_META	0x20	/* same, plus update metapage */
 #define XLOG_BTREE_SPLIT_L		0x30	/* add index tuple with split */
 #define XLOG_BTREE_SPLIT_R		0x40	/* as above, new item on right */
-/* 0x50 and 0x60 are unused */
+#define XLOG_BTREE_DEDUP_PAGE	0x50	/* compactify tuples on the page */
+/* 0x60 is unused */
 #define XLOG_BTREE_DELETE		0x70	/* delete leaf index tuples for a page */
 #define XLOG_BTREE_UNLINK_PAGE	0x80	/* delete a half-dead page */
 #define XLOG_BTREE_UNLINK_PAGE_META 0x90	/* same, and update metapage */
@@ -61,16 +62,21 @@ typedef struct xl_btree_metadata
  * This data record is used for INSERT_LEAF, INSERT_UPPER, INSERT_META.
  * Note that INSERT_META implies it's not a leaf page.
  *
- * Backup Blk 0: original page (data contains the inserted tuple)
+ * Backup Blk 0: original page (data contains the inserted tuple);
+ *				 if in_posting_offset is valid, this is an insertion
+ *				 into existing posting tuple at offnum.
+ *				 redo must repeat logic of bt_insertonpg().
  * Backup Blk 1: child's left sibling, if INSERT_UPPER or INSERT_META
  * Backup Blk 2: xl_btree_metadata, if INSERT_META
+ *
  */
 typedef struct xl_btree_insert
 {
 	OffsetNumber offnum;
+	OffsetNumber in_posting_offset;
 } xl_btree_insert;
 
-#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, offnum) + sizeof(OffsetNumber))
+#define SizeOfBtreeInsert	(offsetof(xl_btree_insert, in_posting_offset) + sizeof(OffsetNumber))
 
 /*
  * On insert with split, we save all the items going into the right sibling
@@ -96,6 +102,11 @@ typedef struct xl_btree_insert
  * An IndexTuple representing the high key of the left page must follow with
  * either variant.
  *
+ * In case, split included insertion into the middle of the posting tuple, and
+ * thus required posting tuple replacement, it also contains 'in_posting_offset',
+ * that is used to form replacing tuple and repean bt_insertonpg() logic.
+ * It is added to xlog only if replacing item remains on the left page.
+ *
  * Backup Blk 1: new right page
  *
  * The right page's data portion contains the right page's tuples in the form
@@ -113,9 +124,26 @@ typedef struct xl_btree_split
 	uint32		level;			/* tree level of page being split */
 	OffsetNumber firstright;	/* first item moved to right page */
 	OffsetNumber newitemoff;	/* new item's offset (if placed on left page) */
+	OffsetNumber in_posting_offset; /* offset inside posting tuple  */
 } xl_btree_split;
 
-#define SizeOfBtreeSplit	(offsetof(xl_btree_split, newitemoff) + sizeof(OffsetNumber))
+#define SizeOfBtreeSplit	(offsetof(xl_btree_split, in_posting_offset) + sizeof(OffsetNumber))
+
+/*
+ * When page is deduplicated, consecutive groups of tuples with equal keys
+ * are compactified into posting tuples.
+ * The WAL record keeps number of resulting posting tuples - n_intervals
+ * followed by array of dedupInterval structures, that hold information
+ * needed to replay page deduplication without extra comparisons of tuples keys.
+ */
+typedef struct xl_btree_dedup
+{
+	int			n_intervals;
+
+	/* TARGET DEDUP INTERVALS FOLLOW AT THE END */
+} xl_btree_dedup;
+#define SizeOfBtreeDedup (sizeof(int))
+
 
 /*
  * This is what we need to know about delete of individual leaf index tuples.
@@ -173,10 +201,19 @@ typedef struct xl_btree_vacuum
 {
 	BlockNumber lastBlockVacuumed;
 
-	/* TARGET OFFSET NUMBERS FOLLOW */
+	/*
+	 * This field helps us to find beginning of the remaining tuples from
+	 * postings which follow array of offset numbers.
+	 */
+	uint32		nremaining;
+	uint32		ndeleted;
+
+	/* REMAINING OFFSET NUMBERS FOLLOW (nremaining values) */
+	/* REMAINING TUPLES TO INSERT FOLLOW (if nremaining > 0) */
+	/* TARGET OFFSET NUMBERS FOLLOW (if any) */
 } xl_btree_vacuum;
 
-#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, lastBlockVacuumed) + sizeof(BlockNumber))
+#define SizeOfBtreeVacuum	(offsetof(xl_btree_vacuum, ndeleted) + sizeof(BlockNumber))
 
 /*
  * This is what we need to know about marking an empty branch for deletion.
diff --git a/src/tools/valgrind.supp b/src/tools/valgrind.supp
index ec47a22..71a03e3 100644
--- a/src/tools/valgrind.supp
+++ b/src/tools/valgrind.supp
@@ -212,3 +212,24 @@
    Memcheck:Cond
    fun:PyObject_Realloc
 }
+
+# Temporarily work around bug in datum_image_eq's handling of the cstring
+# (typLen == -2) case.  datumIsEqual() is not affected, but also doesn't handle
+# TOAST'ed values correctly.
+#
+# FIXME: Remove both suppressions when bug is fixed on master branch
+{
+   temporary_workaround_1
+   Memcheck:Addr1
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}
+
+{
+   temporary_workaround_8
+   Memcheck:Addr8
+   fun:bcmp
+   fun:datum_image_eq
+   fun:_bt_keep_natts_fast
+}

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to