On Mon, Sep 7, 2015 at 9:20 PM, Peter Geoghegan <p...@heroku.com> wrote:
>>> This matters because a major cost during CREATE INDEX CONCURRENTLY is
>>> a TID-based datum sort (this is probably most of the cost over and
>>> above a conventional CREATE INDEX).
>> Might be better to hack a special case right there (ie, embed TIDs into
>> int8s and sort the int8s) rather than try to change the type's SQL
>> declaration.
> That sounds at least as effective as what I originally sketched.

> I hope someone picks this up soon.

I suggested to someone else that he take a look at this as a project,
but I guess he was busy too. I decided to just do it myself. Patch is
attached. This has the bulkdelete callback that gathers TIDs from the
index during CREATE INDEX CONCURRENTLY encode them as int8 values
ahead of the sort, while sorting the values as int8 values. They're
later decoded as needed.

This turns out to be a relatively simple patch. Testing shows it
removes *most* of the overhead of CREATE INDEX CONCURRENTLY over
CREATE INDEX. In absolute terms, a test case involving a CREATE INDEX
CONCURRENTLY on a single integer column takes about 30% - 40% less
time with the patch applied. The TID sort itself is about 3 times
faster, and that's without the benefit of the patch making the TID
sort an internal sort where it would otherwise have been an external
sort. Sorting item pointers as TIDs naively currently wastes a lot of
memory (not just memory bandwidth) -- a pass-by-value datum sort only
ever allocates memory for SortTuples, avoiding allocating any memory
for a "tuple proper". Clearly just having the sort be pass-by-value is
*much* faster. As comments above process_ordered_aggregate_single()
put it:

 * The one-input case is handled separately from the multi-input case
 * for performance reasons: for single by-value inputs, such as the
 * common case of count(distinct id), the tuplesort_getdatum code path
 * is around 300% faster.  (The speedup for by-reference types is less
 * but still noticeable.)

Another way of stating how things are improved here is: my CREATE
INDEX CONCURRENTLY test case had about a 2.1x overhead over an
equivalent CREATE INDEX on master, but with the patch applied the
CREATE INDEX CONCURRENTLY has an overhead of only about 1.3x. The
extra logical I/O for CONCURRENTLY's second scan, and for the
physical-order index scan (to "bulk delete", to gather TIDs to sort)
is a surprisingly small cost.

I'll add the patch to the open commitfest. Think of all these patches
of mine as giving reviewers a choice of what to review. This patch
does seem like a slam dunk, even if I do say so myself, so at least
it's relatively easy to review. The external sort work remains my most
interesting pending work, though.

Peter Geoghegan
From 3c7551435394090a4c2d3e7588d018ea9dca2482 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sat, 14 Nov 2015 16:57:20 -0800

Add a simple TID encoding scheme.  This is used for encoding and
decoding TIDs to and from int8/int64 values, for the benefit of an
expensive tuplesort operation performed by CREATE INDEX CONCURRENTLY.
The sort now uses the default int8 opclass to sort values in the ad-hoc
encoding, and so benefits from the availability of int8 SortSupport.

Despite the fact that item pointers take up only 6 bytes on disk, the
TID type is always pass-by-reference due to considerations about how
Datums are represented and accessed.  On the other hand, int8 is
usually, in practice, a pass-by-value type.

Testing shows this approach makes CREATE INDEX CONCURRENTLY
significantly faster on platforms where int8 is pass-by-value,
especially when the big saving in memory within tuplesort.c prevents the
sort from being performed externally.  The approach will help to some
degree on all platforms, though.
 src/backend/catalog/index.c | 56 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 52 insertions(+), 4 deletions(-)

diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..7429f3c 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -109,6 +109,8 @@ static void index_update_stats(Relation rel,
 static void IndexCheckExclusion(Relation heapRelation,
 					Relation indexRelation,
 					IndexInfo *indexInfo);
+static inline int64 itemptr_encode(ItemPointer itemptr);
+static inline void itemptr_decode(ItemPointer itemptr, int64 encoded);
 static bool validate_index_callback(ItemPointer itemptr, void *opaque);
 static void validate_index_heapscan(Relation heapRelation,
 						Relation indexRelation,
@@ -2832,7 +2834,13 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	ivinfo.num_heap_tuples = heapRelation->rd_rel->reltuples;
 	ivinfo.strategy = NULL;
-	state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
+	/*
+	 * Encode TIDs as int8 values for the sort, rather than directly sorting
+	 * item pointers.  This can be significantly faster, primarily because TID
+	 * is a pass-by-reference type on all platforms, whereas int8 is
+	 * pass-by-value on most platforms.
+	 */
+	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
@@ -2872,14 +2880,45 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
+ * itemptr_encode - Encode ItemPointer as int64/int8
+ *
+ * This representation must produce values encoded as int64 that sort in the
+ * same order as their corresponding original TID values would.
+ */
+static inline int64
+itemptr_encode(ItemPointer itemptr)
+	BlockNumber		block = ItemPointerGetBlockNumber(itemptr);
+	OffsetNumber	offset = ItemPointerGetOffsetNumber(itemptr);
+	int64			encoded;
+	encoded = ((int64) block << 16) | offset;
+	return encoded;
+ * itemptr_decode - Decode int64/int8 representation back to ItemPointer
+ */
+static inline void
+itemptr_decode(ItemPointer itemptr, int64 encoded)
+	BlockNumber		block = (BlockNumber) (encoded >> 16);
+	OffsetNumber	offset = (OffsetNumber) (encoded & 0xFFFF);
+	ItemPointerSet(itemptr, block, offset);
  * validate_index_callback - bulkdelete callback to collect the index TIDs
 static bool
 validate_index_callback(ItemPointer itemptr, void *opaque)
 	v_i_state  *state = (v_i_state *) opaque;
+	int64		encoded = itemptr_encode(itemptr);
-	tuplesort_putdatum(state->tuplesort, PointerGetDatum(itemptr), false);
+	tuplesort_putdatum(state->tuplesort, Int64GetDatum(encoded), false);
 	state->itups += 1;
 	return false;				/* never actually delete anything */
@@ -2911,6 +2950,7 @@ validate_index_heapscan(Relation heapRelation,
 	/* state variables for the merge */
 	ItemPointer indexcursor = NULL;
+	ItemPointerData		decoded;
 	bool		tuplesort_empty = false;
@@ -3020,13 +3060,21 @@ validate_index_heapscan(Relation heapRelation,
 				if (ItemPointerGetBlockNumber(indexcursor) == root_blkno)
 					in_index[ItemPointerGetOffsetNumber(indexcursor) - 1] = true;
-				pfree(indexcursor);
 			tuplesort_empty = !tuplesort_getdatum(state->tuplesort, true,
 												  &ts_val, &ts_isnull);
 			Assert(tuplesort_empty || !ts_isnull);
-			indexcursor = (ItemPointer) DatumGetPointer(ts_val);
+			if (!tuplesort_empty)
+			{
+				itemptr_decode(&decoded, DatumGetInt64(ts_val));
+				indexcursor = &decoded;
+				/* If int8 is pass-by-ref, free (encoded) TID Datum memory */
+				pfree(DatumGetPointer(ts_val));
+			}

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to