Re: [HACKERS] Inlining comparators as a performance optimisation

Peter Geoghegan Sat, 26 Nov 2011 10:58:39 -0800

3 items are attached:

1. A spreadsheet, results.ods, which has results for automated
multiple runs of the "doubled up" dellstore orderlines queries (where
orderlines is 385 MB). Results are sorted, with the median (or the
lower middle value, didn't get the mean of the two middle runs) result
for each run highlighted as representative. There are 3 builds of
Postgres (HEAD, not inlined, optimized), each on its own sheet in the
spreadsheet. The cache was warmed for each query, and subsequently the
query was run 20 times.


2. A python script I hacked together that should run on anything
around Python 2.5+, with a single dependency, psycopg2. Look at --help
to see how it works. This is the script that actually generated the
figures that went into the spreadsheet, by being run once for each
type of build. You can fairly easily play along at home with this, for
these and other queries. It will spit out CSV files. It is designed to
warm the cache (by default 3 times before each query) to eliminate
caching effects. You can add your own query to the python list to have
it run by the script, to generate the same set of figures for that
query. I'm rather curious as to how much of an impact this
optimisation will have on queries with unique nodes, joins and
grouping when they rely on a sort node for their input, in the real
world. Certainly, a query need not have an ORDER BY clause to see
benefits, perhaps substantial benefits. The logic to parse sort node
details is rather unsophisticated right now though, due to my focus on
the obvious ORDER BY case.

3. The revision of the patch that was actually tested, now with
inlining specialisations for the single scanKey case, and a
non-inlined specialisation for multiple scanKeys where we can still
benefit from eliding the SQL function call machinery for the first
scanKey, which is often almost as useful as being able to do so for
all scanKeys. It also selects a sorting specialisation when it can in
tuplesort_begin_heap, per Robert's suggestion.

Reviewers will want to comment out line 731 of tuplesort.c, "#define
optimize", to quickly get unoptimized behaviour for comparative
purposes. Furthermore, if you'd like to see what this would look like
without inlining, you can simply comment out assignments of inline
variants of specialisations (i.e. int4inlqsort_arg and
int8inlqsort_arg) in tuplesort_begin_heap.

It has been suggested that I'm chasing diminishing returns by
inlining, as I go further for a smaller benefit. Of course, that's
true. However, I believe that I'm not chasing them past the point
where that ceases to make sense, and these figures support that
contention - chasing diminishing returns in the nature of this kind of
work. Here, the additional benefit of inlining accounts for over an
entire second shaved off a query that was originally 7056ms, so that's
not to be sniffed at.

I'll reiterate that qsort_arg has only been modified 4 times after its
initial commit in 2006, and these were all trivial changes. While the
way that I ape generic programming with the preprocessor is on the
ugly side, what I've done can be much more easily understood with
reference to qsort_arg itself. Robert said that duplicating the sort
function was "iffy". However, that already happened long ago, as we're
already maintaining both qsort_arg.c and qsort.c, and comments already
urge maintainers to keep the two consistent. That's not been a problem
though, because, as I've said, there has never been any call to make
substantive changes to either in all those years.

We could probably further reduce the code footprint of all of this by
having template_qsort_arg.h generate comparators directly. That might
be inflexible in a way that turns out to matter though, like if we
wanted to use this for non-HAVE_INT64_TIMESTAMP timestamps and had to
add NaN crud.

My next step is to see how this goes with hundreds of sorts in the
hundreds of megabytes on a high-end server. I don't have immediate
access to one, but I'm working that out.

-- 
Peter Geoghegan       http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training and Services

From 92185762297ba9819eb7f1b208db03663d9fdfa8 Mon Sep 17 00:00:00 2001
From: peter2ndQuadrant <pe...@2ndquadrant.com>
Date: Sun, 20 Nov 2011 20:36:25 +0000
Subject: [PATCH] Initial commit of optimization

Stop directly using oid

Added int8 quicksort fast path specialisation, which can also be used in
place of F_TIMESTAMP_CMP for HAVE_INT64_TIMESTAMP builds.

Rebased, revised patch for -hackers, with timestamp and int8 fast path
sorting using the same int8 specialization.

Remove unneeded line

Rebase for -hackers

Descriminate against cases where nKeys > 1 when selecting sort
specialisation.

We now support fast path sorting with multiple scanKeys.

We now inline for one scanKey, but do not inline our comparetup_heap
replacement when multiple scanKeys are used (although this is at the
compiler's discretion). However, when multiple scanKeys are used, we
still use a specialisation that will elide the SQL-function-call
machinery for the first scanKey, which is almost as good for most cases.

Further formatting clean-ups

Move specialization selection logic to tuplesort_begin_heap(), per
suggestion of Robert Haas

Add simple switch for testing optimization.

Improvements to template qsort_arg comments. Support fast path sorting
of dates.

Added float fast path sorting

Patch mimimization.

Additional clean-up.
---
 src/backend/utils/sort/tuplesort.c     |  156 ++++++++++++++++-
 src/include/utils/template_qsort_arg.h |  290 ++++++++++++++++++++++++++++++++
 src/port/qsort.c                       |    3 +-
 3 files changed, 440 insertions(+), 9 deletions(-)
 create mode 100644 src/include/utils/template_qsort_arg.h

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 3505236..eb36e17 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -107,11 +107,13 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "utils/datum.h"
+#include "utils/fmgroids.h"
 #include "utils/logtape.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_rusage.h"
 #include "utils/rel.h"
+#include "utils/template_qsort_arg.h"
 #include "utils/tuplesort.h"
 
 
@@ -226,6 +228,13 @@ struct Tuplesortstate
 										   Tuplesortstate *state);
 
 	/*
+	 * This specialization function pointer is sometimes used as an alternative
+	 * to the standard qsort_arg, when it has been determined that we can 
+	 * benefit from various per-type performance optimizations.
+	 */
+	void (*qsort_arg_spec)(void *a, size_t n, size_t es, void *arg);
+
+	/*
 	 * Function to copy a supplied input tuple into palloc'd space and set up
 	 * its SortTuple representation (ie, set tuple/datum1/isnull1).  Also,
 	 * state->availMem must be decreased by the amount of space used for the
@@ -457,6 +466,11 @@ static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
+static inline int32 inlineApplySortFunction(FmgrInfo *sortFunction, 
+						int sk_flags, Oid collation,
+						Datum datum1, bool isNull1,
+						Datum datum2, bool isNull2);
+
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -576,6 +590,71 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	return state;
 }
 
+/*
+ * Manufacture type specific sorting specialisations
+ * with inline comparators
+ */
+static inline int
+compare_int4(Datum first, Datum second)
+{
+	int32           first_int = DatumGetInt32(first);
+	int32           second_int = DatumGetInt32(second);
+	if (first_int > second_int)
+		return 1;
+	else if (first_int == second_int)
+		return 0;
+	else
+		return -1;
+}
+
+static inline int
+compare_int8(Datum first, Datum second)
+{
+	int64           first_int = DatumGetInt64(first);
+	int64           second_int = DatumGetInt64(second);
+
+	if (first_int > second_int)
+		return 1;
+	else if (first_int == second_int)
+		return 0;
+	else
+		return -1;
+}
+
+static inline int
+compare_float4(Datum first, Datum second)
+{
+	float4 f = DatumGetInt64(first);
+	float4 s = DatumGetInt64(second);
+
+
+	if (f > s)
+		return 1;
+	else if (f == s)
+		return 0;
+	else
+		return -1;
+}
+
+static inline int
+compare_float8(Datum first, Datum second)
+{
+	float8 f = DatumGetInt64(first);
+	float8 s = DatumGetInt64(second);
+
+	if (f > s)
+		return 1;
+	else if (f == s)
+		return 0;
+	else
+		return -1;
+}
+
+TEMPLATE_QSORT_ARG(int4, compare_int4);
+TEMPLATE_QSORT_ARG(int8, compare_int8);
+TEMPLATE_QSORT_ARG(float4, compare_float4);
+TEMPLATE_QSORT_ARG(float8, compare_float8);
+
 Tuplesortstate *
 tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
@@ -649,7 +728,43 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 							   sortFunction,
 							   (Datum) 0);
 	}
-
+#define optimize
+
+#ifdef optimize
+	/* Select a qsort specialization, if possible */
+	if (state->scanKeys &&
+		  /* Some comparison that has an underlying int4 representation */
+		(
+		  state->scanKeys->sk_func.fn_oid == F_BTINT4CMP
+			|| state->scanKeys->sk_func.fn_oid == F_DATE_CMP
+		)
+		)
+	{
+		if (state->nKeys == 1)
+			state->qsort_arg_spec = int4inlqsort_arg;
+		else
+			state->qsort_arg_spec = int4regqsort_arg;
+	}
+	else if (state->scanKeys &&
+			  /* Some comparison that has an underlying int8 representation */
+			(
+			  state->scanKeys->sk_func.fn_oid == F_BTINT8CMP
+#ifdef HAVE_INT64_TIMESTAMP
+			  || state->scanKeys->sk_func.fn_oid == F_TIMESTAMP_CMP
+#endif
+		
+			))
+	{
+		if (state->nKeys == 1)
+			state->qsort_arg_spec = int8inlqsort_arg;
+		else
+			state->qsort_arg_spec = int8regqsort_arg;
+	}
+	else
+	{
+		state->qsort_arg_spec = NULL;
+	}
+#endif
 	MemoryContextSwitchTo(oldcontext);
 
 	return state;
@@ -1225,7 +1340,7 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 }
 
 /*
- * All tuples have been provided; finish the sort.
+ * All tuples have been provided; perform the sort.
  */
 void
 tuplesort_performsort(Tuplesortstate *state)
@@ -1247,11 +1362,28 @@ tuplesort_performsort(Tuplesortstate *state)
 			 * amount of memory.  Just qsort 'em and we're done.
 			 */
 			if (state->memtupcount > 1)
-				qsort_arg((void *) state->memtuples,
-						  state->memtupcount,
-						  sizeof(SortTuple),
-						  (qsort_arg_comparator) state->comparetup,
-						  (void *) state);
+			{
+				/* Consider a sorting specialization */
+				if (state->qsort_arg_spec)
+				{
+					state->qsort_arg_spec((void *) state->memtuples,
+								  state->memtupcount,
+								  sizeof(SortTuple),
+								  (void *) state);
+				}
+				else
+				{
+					/*
+					 * Fall back on type-neutral qsort with
+					 * function-pointer comparator
+					 */
+					qsort_arg((void *) state->memtuples,
+								  state->memtupcount,
+								  sizeof(SortTuple),
+								  (qsort_arg_comparator) state->comparetup,
+								  (void *) state);
+				}
+			}
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -2657,6 +2789,9 @@ myFunctionCall2Coll(FmgrInfo *flinfo, Oid collation, Datum arg1, Datum arg2)
  * and return a 3-way comparison result.  This takes care of handling
  * reverse-sort and NULLs-ordering properly.  We assume that DESC and
  * NULLS_FIRST options are encoded in sk_flags the same way btree does it.
+ *
+ * Note that this logic is more-or-less duplicated in template_qsort_arg.h,
+ * and "instantiated" per-type there.
  */
 static inline int32
 inlineApplySortFunction(FmgrInfo *sortFunction, int sk_flags, Oid collation,
@@ -2708,10 +2843,15 @@ ApplySortFunction(FmgrInfo *sortFunction, int sortFlags, Oid collation,
 }
 
 
+
+
 /*
  * Routines specialized for HeapTuple (actually MinimalTuple) case
+ *
+ * N.B. : Some of this code is partially duplicated for the purposes 
+ * of heap comparison specializations. Take care to keep the code
+ * within template_qsort_arg.h consistent with this code.
  */
-
 static int
 comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
 {
diff --git a/src/include/utils/template_qsort_arg.h b/src/include/utils/template_qsort_arg.h
new file mode 100644
index 0000000..3554707
--- /dev/null
+++ b/src/include/utils/template_qsort_arg.h
@@ -0,0 +1,290 @@
+/*-------------------------------------------------------------------------
+ *	template_qsort_arg.h: "template" version of qsort_arg.c
+ *
+ *	This version of qsort_arg is exclusively used within tuplesort.c to
+ *	more efficiently sort common types such as integers and floats. In
+ *	providing this version, we seek to take advantage of compile-time
+ *	optimisations for specific types, in particular, the inlining of
+ *	comparators, as well as reducing the overhead of comparisons that
+ *	the general SQL-callable-function API imposes.
+ *
+ *  The TEMPLATE_QSORT_ARG() macro generates an inlining variant (for
+ *  clients with a single scanKey) and non-inlining variant for sorts
+ *  with multiple scanKeys.
+ *
+ *  We rely on the availability of various functions within tuplesort.c,
+ *  and indeed we partially duplicate some code from there, so this file
+ *  should be considered private to that module, rather than a generic
+ *  piece of infrastructure.
+ *
+ *	CAUTION: if you change this file, see also qsort_arg.c as well as
+ *	qsort.c. qsort_arg.c should be considered authoratative.
+ *
+ *	Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ *	Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *	src/include/utils/template_qsort_arg.h
+ *-------------------------------------------------------------------------
+ */
+
+#include "c.h"
+
+#define swapcode(TYPE, parmi, parmj, n) \
+do {									\
+	size_t i = (n) / sizeof (TYPE);		\
+	TYPE *pi = (TYPE *)(void *)(parmi);	\
+	TYPE *pj = (TYPE *)(void *)(parmj);	\
+	do {								\
+		TYPE	t = *pi;				\
+		*pi++ = *pj;					\
+		*pj++ = t;						\
+		} while (--i > 0);				\
+} while (0)
+
+#define SWAPINIT(a, es) swaptype = ((char *)(a) - (char *)0) % sizeof(long) || \
+	(es) % sizeof(long) ? 2 : (es) == sizeof(long)? 0 : 1;
+
+#define thistype_vecswap(TYPE, a, b, n, SPEC_VAR)							\
+	if ((n) > 0) TYPE##SPEC_VAR##swapfunc((a), (b), (size_t)(n), swaptype) 	
+
+#define thistype_swap(TYPE, a, b, SPEC_VAR)			\
+if (swaptype == 0) {								\
+		long t = *(long *)(void *)(a);				\
+		*(long *)(void *)(a) = *(long *)(void *)(b);\
+		*(long *)(void *)(b) = t;					\
+	} else											\
+		TYPE##SPEC_VAR##swapfunc(a, b, es, swaptype)	
+
+/* 
+ * This macro manufactures a type-specific implementation of qsort_arg with 
+ * the comparator, COMPAR, known at compile time. COMPAR is typically an
+ * inline function.
+ *
+ * COMPAR should take as its arguments two Datums, and return an int, in
+ * line with standard qsort convention.
+ * 
+ * We have void* parameters for TYPE##AppSort just to shut up the compiler. 
+ * They could be SortTuple pointers instead, but that would make it more 
+ * difficult to keep template_qsort_arg.h consistent with tuplesort.c. 		
+ *
+ */
+
+#define DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, SPEC_VAR, REG_ADDITIONAL_CODE)			\
+void TYPE##SPEC_VAR##qsort_arg(void *a, size_t n, size_t es, void *arg);			\
+																					\
+static inline int32																	\
+TYPE##SPEC_VAR##AppSort(const void *a, const void *b, Tuplesortstate *state)		\
+{																					\
+	int32		compare;															\
+	const SortTuple* aT = a;														\
+	const SortTuple* bT = b;														\
+																					\
+	/* Allow interrupting long sorts */												\
+	CHECK_FOR_INTERRUPTS();															\
+																					\
+	Assert(state->scanKeys);														\
+																					\
+	if ( aT->isnull1)																\
+	{																				\
+		if (bT->isnull1)															\
+			compare = 0;		/* NULL "=" NULL */									\
+		else if (state->scanKeys->sk_flags & SK_BT_NULLS_FIRST)						\
+			compare = -1;		/* NULL "<" NOT_NULL */								\
+		else																		\
+			compare = 1;		/* NULL ">" NOT_NULL */								\
+	}																				\
+	else if (bT->isnull1)															\
+	{																				\
+		if (state->scanKeys->sk_flags & SK_BT_NULLS_FIRST)							\
+			compare = 1;		/* NOT_NULL ">" NULL */								\
+		else																		\
+			compare = -1;		/* NOT_NULL "<" NULL */								\
+	}																				\
+	else																			\
+	{																				\
+		compare = COMPAR(aT->datum1, bT->datum1);									\
+																					\
+		if (state->scanKeys->sk_flags & SK_BT_DESC)									\
+			compare = -compare;														\
+	}																				\
+	if (compare != 0) 																\
+		return compare;																\
+	/* Additional code for variants where we have more than one scanKey */ 			\
+	REG_ADDITIONAL_CODE 															\
+	return 0;																		\
+}																					\
+																					\
+static inline void													\
+TYPE##SPEC_VAR##swapfunc(char *a, char *b, size_t n, int swaptype)	\
+{																	\
+	if (swaptype <= 1)												\
+		swapcode(long, a, b, n); 									\
+	else															\
+		swapcode(char, a, b, n);									\
+}																	\
+																	\
+static inline char *														\
+TYPE##SPEC_VAR##med3(char *a, char *b, char *c, void *arg)					\
+{																			\
+	return TYPE##SPEC_VAR##AppSort(a, b, arg) < 0 ?							\
+		(TYPE##SPEC_VAR##AppSort(b, c, arg) < 0 ? 							\
+					b : (TYPE##SPEC_VAR##AppSort(a, c, arg) < 0 ? c : a))	\
+		: (TYPE##SPEC_VAR##AppSort(b, c, arg) > 0 ? 						\
+					b : (TYPE##SPEC_VAR##AppSort(a, c, arg) < 0 ? a : c));	\
+}																			\
+																			\
+void 																	\
+TYPE##SPEC_VAR##qsort_arg(void *a, size_t n, size_t es, void *arg)		\
+{																		\
+	char	   *pa,														\
+			   *pb,														\
+			   *pc,														\
+			   *pd,														\
+			   *pl,														\
+			   *pm,														\
+			   *pn;														\
+	int			d,														\
+				r,														\
+				swaptype,												\
+				presorted;												\
+																		\
+loop:SWAPINIT(a, es); 													\
+	if (n < 7)															\
+	{																	\
+		for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)	\
+			for (pl = pm; pl > (char *) a && 							\
+					TYPE##SPEC_VAR##AppSort(pl - es, pl, arg) > 0;		\
+				 pl -= es)												\
+				thistype_swap(TYPE, pl, pl - es, SPEC_VAR);				\
+		return;															\
+	}																	\
+	presorted = 1;														\
+	for (pm = (char *) a + es; pm < (char *) a + n * es; pm += es)		\
+	{																	\
+		if (TYPE##SPEC_VAR##AppSort(pm - es, pm, arg) > 0)				\
+		{																\
+			presorted = 0;												\
+			break;														\
+		}																\
+	}																	\
+	if (presorted)														\
+		return;															\
+	pm = (char *) a + (n / 2) * es;										\
+	if (n > 7)															\
+	{																	\
+		pl = (char *) a;												\
+		pn = (char *) a + (n - 1) * es;									\
+		if (n > 40)														\
+		{																\
+			d = (n / 8) * es;											\
+			pl = TYPE##SPEC_VAR##med3(pl, pl + d, pl + 2 * d, arg);		\
+			pm = TYPE##SPEC_VAR##med3(pm - d, pm, pm + d, arg);			\
+			pn = TYPE##SPEC_VAR##med3(pn - 2 * d, pn - d, pn, arg);		\
+		}																\
+		pm = TYPE##SPEC_VAR##med3(pl, pm, pn, arg);						\
+	}																	\
+	thistype_swap(TYPE, a, pm, SPEC_VAR);								\
+	pa = pb = (char *) a + es;											\
+	pc = pd = (char *) a + (n - 1) * es;								\
+	for (;;)															\
+	{																	\
+		while (pb <= pc && 												\
+					(r = TYPE##SPEC_VAR##AppSort(pb, a, arg)) <= 0)		\
+		{																\
+			if (r == 0)													\
+			{															\
+				thistype_swap(TYPE, pa, pb, SPEC_VAR);					\
+				pa += es;												\
+			}															\
+			pb += es;													\
+		}																\
+		while (pb <= pc && 												\
+				(r = TYPE##SPEC_VAR##AppSort(pc, a, arg)) >= 0)			\
+		{																\
+			if (r == 0)													\
+			{															\
+				thistype_swap(TYPE, pc, pd, SPEC_VAR);					\
+				pd -= es;												\
+			}															\
+			pc -= es;													\
+		}																\
+		if (pb > pc)													\
+			break;														\
+		thistype_swap(TYPE, pb, pc, SPEC_VAR);							\
+		pb += es;														\
+		pc -= es;														\
+	}																	\
+	pn = (char *) a + n * es;											\
+	r = Min(pa - (char *) a, pb - pa);									\
+	thistype_vecswap(TYPE, a, pb - r, r, SPEC_VAR);						\
+	r = Min(pd - pc, pn - pd - es);										\
+	thistype_vecswap(TYPE, pb, pn - r, r, SPEC_VAR);					\
+	if ((r = pb - pa) > es)												\
+		TYPE##SPEC_VAR##qsort_arg(a, r / es, es, arg);					\
+	if ((r = pd - pc) > es)												\
+	{																	\
+		/* Iterate rather than recurse to save stack space */			\
+		a = pn - r;														\
+		n = r / es;														\
+		goto loop;														\
+	}																	\
+}
+
+/*
+ * This code becomes part of the comparator meta-function for the "reg" 
+ * specialization variant of each datatype-specific specialization.
+ *
+ * For "reg", we can handle multiple scankeys, but the function generally
+ * will not be inlined. Inlining has been found to offer additional performance 
+ * benefits beyond eliding the regular SQL function-call machinery.
+ *
+ * We do not elide the regular SQL function-call machinery in the "reg" case
+ * for the second and subsequent scanKeys. The higher the cardinality of the
+ * first scanKey column, the less frequently we'll need to fall back.
+ */ 
+
+#define REG_ADDITIONAL_CODE 														\
+{												 									\
+	ScanKey		scanKey = state->scanKeys;											\
+	HeapTupleData ltup;																\
+	HeapTupleData rtup;																\
+	TupleDesc	tupDesc;															\
+	int			nkey;																\
+																					\
+	Assert(state->nKeys > 1)														\
+	/* Compare additional sort keys */												\
+	ltup.t_len = ((MinimalTuple) aT->tuple)->t_len + MINIMAL_TUPLE_OFFSET;			\
+	ltup.t_data = (HeapTupleHeader) ((char *) aT->tuple - MINIMAL_TUPLE_OFFSET);	\
+	rtup.t_len = ((MinimalTuple) bT->tuple)->t_len + MINIMAL_TUPLE_OFFSET;			\
+	rtup.t_data = (HeapTupleHeader) ((char *) bT->tuple - MINIMAL_TUPLE_OFFSET);	\
+	tupDesc = state->tupDesc;														\
+	scanKey++;																		\
+	for (nkey = 1; nkey < state->nKeys; nkey++, scanKey++)							\
+	{																				\
+		AttrNumber	attno = scanKey->sk_attno;										\
+		Datum		datum1,															\
+					datum2;															\
+		bool		isnull1,														\
+					isnull2;														\
+																					\
+		datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);						\
+		datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);						\
+																					\
+		compare = inlineApplySortFunction(&scanKey->sk_func,						\
+										  scanKey->sk_flags,						\
+										  scanKey->sk_collation,					\
+										  datum1, isnull1,							\
+										  datum2, isnull2);							\
+		if (compare != 0)															\
+			return compare;															\
+	}																				\
+}
+
+/* 
+ * Manufacture inlining variant for nKeys=1 case, and non-inlining variant
+ * for nKeys > 1 case
+ */	
+#define TEMPLATE_QSORT_ARG(TYPE, COMPAR)	\
+DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, inl, ;) 	\
+DO_TEMPLATE_QSORT_ARG(TYPE, COMPAR, reg, REG_ADDITIONAL_CODE)	\
+	
diff --git a/src/port/qsort.c b/src/port/qsort.c
index 8e2c6d9..c949c8e 100644
--- a/src/port/qsort.c
+++ b/src/port/qsort.c
@@ -7,7 +7,8 @@
  *	  Remove ill-considered "swap_cnt" switch to insertion sort,
  *	  in favor of a simple check for presorted input.
  *
- *	CAUTION: if you change this file, see also qsort_arg.c
+ *	CAUTION: if you change this file, see also qsort_arg.c and 
+ *	template_qsort_arg.h
  *
  *	src/port/qsort.c
  */
-- 
1.7.7.3

results.ods
Description: application/vnd.oasis.opendocument.spreadsheet

#!/bin/env python
# tuplesort_test.py : Test the effectiveness of fast-path tuplesorting

# It's intended for this program to have 3 runs, which will be compared.

# run 1 - unaltered HEAD
# run 2 - Patch, but with inline specializations commented out
# run 3 - Patch

# We parse explain analyze output, because it's important to isolate
# query speed from any client overhead

import psycopg2
import subprocess
import json # json explains used
import time
import csv
from optparse import OptionParser

queries = [
			# Queries from dellstore database
			"select * from orderlines order by prod_id;",
			"select * from orderlines order by prod_id, quantity;" # For "not inlined" codepath, that uses direct comparator for first scanKey.
			]

# Walk explain tree. Should be sufficiently flexible for our purposes 
# (i.e not very) - assumes an in-memory sort node is parent node
def walk_tree(tree):
	for od in tree: # outer dictionary
		for k,v in od.items():
			if k == "Plan":
				for at, val in v.items():
					if at == "Actual Total Time":
						print at,": ",  val
						act_tot_time = float(val)
					elif at == "Sort Space Used":
						print at, ": ", val
						sort_space = val
					elif at == "Sort Space Type":
						assert(val == "Memory")
					elif at == "Node Type":
						assert(val == "Sort")


	return [ act_tot_time, sort_space, False ]

# store results of a given run in a dedicated csv file
def serialize_to_file(vals_dic, filename):
	wrt = csv.writer(open(filename, 'wb'), delimiter=',')

	# mark median run for runs of this 
	# query (or if there is an even number of elements, near enough)
	median_i = (len(vals_dic) + 1) / 2 - 1 

	for i, k in enumerate(vals_dic):
		wrt.writerow([k[0][0], time.ctime(k[0][1]),str(k[1][0]) + "ms", str(k[1][1]) + "kb sort",'*' if i == median_i else 'n'])

def test_main():
	parser = OptionParser(description="")
	parser.add_option('-c', '--conninfo', type=str, help="libpq-style connection info string of database to connect to. "
						"Can be omitted, in which case we get details from our environment. "
						"You'll probably want to put this in double-quotes, like this: --conninfo \"hostaddr=127.0.0.1 port=5432 dbname=postgres user=postgres\". ", default="")
	parser.add_option('-w', '--warmcache', type=int, help="Number of times to execute each statement without recording, to warm the cache.", default=3)
	parser.add_option('-r', '--runs', type=int, help="Number of times to run each query", default=5)
	parser.add_option('-d', '--description', type=str, help="Mandatory description for this run of the tool, such as 'inlining', 'non-inlining' or 'HEAD'", default=None)

	args = parser.parse_args()[0]
	conn_str = args.conninfo

	conn = psycopg2.connect(conn_str)
	cur = conn.cursor()

	warm_cache_n = args.warmcache
	run_n = args.runs

	description = args.description
	
	if description is None:
		raise SystemExit("You must specify a description for this run of tuplesort_test.py!") 


	for qry in queries:
		vals = {}
		print qry
		for i, j in enumerate(range(warm_cache_n + run_n)):
				cur.execute("explain (analyze true, costs true, format json) " + qry)
				if i < warm_cache_n:
					print "Skip recording query that just warms cache"
					continue
				for j in cur:
					expl_ana_tree = json.loads(j[0])
				vals[( qry, time.time() )] = walk_tree(expl_ana_tree) 
			
				# Sort values for reference, and to locate the median value	
				sort_vals = sorted(vals.iteritems(), key=lambda tot_time: tot_time[1][0])
		
		serialize_to_file(sort_vals, description + " " + qry + ".csv")

if __name__== "__main__":
	test_main()

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Inlining comparators as a performance optimisation

Reply via email to