Re: [HACKERS] Using quicksort for every external sort run

Peter Geoghegan Sun, 20 Mar 2016 23:03:07 -0700

On Thu, Mar 17, 2016 at 1:13 PM, Robert Haas <robertmh...@gmail.com> wrote:
> OK, I have now committed 0001


I attach a revision of the external quicksort patch and supplementary
small patches, rebased on top of the master branch.

Changes:

1. As I mentioned on the "Improve memory management for external
sorts" -committers thread, we should protect against currentRun
integer overflow. This new revision does so.

I'm not sure if that change needs to be back-patched; I just don't
want to take any risks, and see this as low cost insurance. Really low
workMem sorts are now actually fast enough that this seems like
something that could happen on a misconfigured system.

2. Add explicit constants for special run numbers that replacement
selection needs to care about in particular.

I did this because change 1 reminded me of the "currentRun vs.
SortTuple.tupindex" run numbering subtleties. The explicit use of
certain run number constants seems to better convey some tricky
details, in part by letting me add a few documenting if obvious
assertions. It's educational to be able to grep for the these
constants (e.g., the new HEAP_RUN_NEXT constant) to jump to the parts
of the code that need to think about replacement selection. As things
were, that code relied on too much from too great a distance (arguably
this is true even in the master branch). This change in turn led to
minor wordsmithing to adjacent areas here and there, most of it
subtractive.

As an example of where this helps, ISTM that the assertion added to
the routine tuplesort_heap_insert() is now self-documenting, which
wasn't the case before.

3. There was one very tricky consideration around an edge-case that
required careful thought. This was an issue within my new function
dumpbatch(). It could previously perform what turns out to be a
superfluous selectnewtape() call when we take the dumpbatch()
"state->memtupcount == 0" early return path (see the last revision for
full details of that now-axed code path). Now, we accept that there
may on occasion be 0 tuple runs. In other words, we now never returned
early from within dumpbatch().

There was previously no explanation for why it was okay to have a
superfluous selectnewtape() call. However, needing to be certain that
any newly selected destTape tape will go on to receive a run is
implied for the general case by this existing selectnewtape() code
comment:

 * This is called after finishing a run when we know another run
 * must be started.  This implements steps D3, D4 of Algorithm D.

While the previous revision was correct anyway, I tried to explain why
it was correct in comments, and soon realized that that was a really
bad idea; the rationale was excessively complicated.

Allowing 0 tuple runs in rare cases seems like the simplest solution.
After all, mergeprereadone() is expressly prepared for 0 tuple runs.
It says "ensure that we have at least one tuple, if any are to be
had". There is no reason to assume that it says this only because it
imagines that no tuples might be found *only after* the first preread
for the merge (by which I mean I don't think that only applies when a
final on-the-fly merge reloads tuples from one particular tape
following running out of tuples of the tape/run in memory).

4. I updated the function beginmerge() to acknowledge an inconsistency
for pass-by-value datumsorts, which I mentioned in passing on this
thread a few days back. The specific change:

 @@ -2610,7 +2735,12 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)

     if (finalMergeBatch)
     {
 -       /* Free outright buffers for tape never actually allocated */
 +       /*
 +        * Free outright buffers for tape never actually allocated.  The
 +        * !state->tuples case is actually entitled to have at least this much
 +        * of a refund ahead of its final merge, but we don't go out of our way
 +        * to marginally improve that case.
 +        */
         FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);

It's not worth worrying about this case, since the savings are small
(especially now that maxTapes is capped). But it's worth acknowledging
that the "!state->tuples" case is being "short-changed", in the new
spirit of heavily scrutinizing where memory goes in tuplesort.c.

5. I updated the "Add MemoryContextStats() calls for debugging" patch.
I now formally propose that this debugging instrumentation be
committed.

This revised debugging instrumentation patch does not have the system
report anything about the memory context just because "trace_sort =
on". Rather, it does nothing on ordinary builds, where the macro
SHOW_MEMORY_STATS will not be defined (it also respects trace_sort).
This is about the same approach seen in postgres.c's
finish_xact_command(). ISTM that we ought to provide a way of
debugging memory use within tuplesort.c, since we now know that that
could be very important. Let's not forget where the useful places to
look for problems are.

6. Based on your feedback on the batch memory patch (your commit
c27033ff), I  made a stylistic change. I made similar comments about
the newly added quicksort/dumpbatch() MemoryContextReset() call, since
it has its own special considerations (a big change in the pattern of
allocations occurs after batch memory is used -- we need to be careful
about how that could impact the "bucketing by size class").

Thanks
-- 
Peter Geoghegan

From b921e285ed3f22c9cab9c78c7c610fbdfee5839b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Thu, 10 Mar 2016 14:52:18 -0800
Subject: [PATCH 3/3] Add MemoryContextStats() calls for debugging

These new calls, which must be explicitly enabled, illustrates how
effective frequent MemoryContextReset() calls are in preventing palloc()
framentation within tuplesort.c.
---
 src/backend/utils/sort/tuplesort.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 2832c86..102f78a 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -2622,6 +2622,14 @@ mergeonerun(Tuplesortstate *state)
 	 */
 	beginmerge(state, false);
 
+#ifdef SHOW_MEMORY_STATS
+#ifdef TRACE_SORT
+	/* Print mem stats before each non-final merge to track fragmentation */
+	if (trace_sort)
+		MemoryContextStats(state->sortcontext);
+#endif
+#endif
+
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
 	 * out, and replacing it with next tuple from same tape (if there is
@@ -2954,6 +2962,14 @@ mergebatch(Tuplesortstate *state, int64 spacePerTape)
 
 	state->batchUsed = true;
 	state->spacePerTape = spacePerTape;
+
+#ifdef SHOW_MEMORY_STATS
+#ifdef TRACE_SORT
+	/* Print mem stats before final merge to track fragmentation */
+	if (trace_sort)
+		MemoryContextStats(state->sortcontext);
+#endif
+#endif
 }
 
 /*
-- 
1.9.1

From da71fc6214a4906b0b1bcd6e6e96bf1d86effe50 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 2/3] Perform memory prefetching when writing memtuples

This patch is based on, but quite distinct to a separately submitted,
more general version which performs prefetching in several places [1].
This version now only performs prefetching of each "tuple proper" during
the writing of batches of tuples (an entire run, written following a
quicksort).  The case for prefetching each "tuple proper" at several
sites now seems weak due to difference in CPU microarchitecture.
However, it might still be that there is a consistent improvement
observable when writing out tuples, because that involves a particularly
tight inner loop, with relatively predictable processing to hide memory
latency behind.  A helpful generic prefetch hint may be possible for
this case, even if it proves impossible elsewhere.

This has been shown to appreciably help on both a POWER7 server
processor [2], and an Intel Mobile processor.

[1] https://commitfest.postgresql.org/6/305/
[2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com
---
 config/c-compiler.m4               | 17 +++++++++++++++++
 configure                          | 31 +++++++++++++++++++++++++++++++
 configure.in                       |  1 +
 src/backend/utils/sort/tuplesort.c | 14 ++++++++++++++
 src/include/c.h                    | 14 ++++++++++++++
 src/include/pg_config.h.in         |  3 +++
 src/include/pg_config.h.win32      |  3 +++
 src/include/pg_config_manual.h     | 10 ++++++++++
 8 files changed, 93 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 550d034..8be2122 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -271,6 +271,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --------------
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index a45be67..75453d2 100755
--- a/configure
+++ b/configure
@@ -11398,6 +11398,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
 $as_echo_n "checking for __VA_ARGS__... " >&6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index c298926..0ee6a36 100644
--- a/configure.in
+++ b/configure.in
@@ -1344,6 +1344,7 @@ PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 27ca7b7..2832c86 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3406,6 +3406,20 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->tp_tapenum[state->destTape],
 				 &state->memtuples[i]);
 		state->memtupcount--;
+
+		/*
+		 * Perform memory prefetch of the tuple pointed to by the SortTuple
+		 * that's two places ahead of tuple just written.  Testing shows that
+		 * this significantly boosts performance.
+		 *
+		 * Don't do this for pass-by-value datum sorts, where it will never
+		 * help.  Note that hinting a NULL address does not affect correctness,
+		 * so NULL values are not a concern here.
+		 */
+#ifdef USE_MEM_PREFETCH
+		if (state->tuples && i + 2 < memtupwrite)
+			pg_read_prefetch(state->memtuples[i + 2].tuple);
+#endif
 	}
 
 	/*
diff --git a/src/include/c.h b/src/include/c.h
index 7c57430..93b53bb 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -927,6 +927,20 @@ typedef NameData *Name;
 #define pg_unreachable() abort()
 #endif
 
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_read_prefetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup with this;  any use of the intrinsic
+ * should be carefully tested.  It's okay to pass it an invalid or NULL
+ * address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_read_prefetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 3813226..41df40c 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -669,6 +669,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index eba36df..3beb09a 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -514,6 +514,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index ef89521..1dd3125 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -148,6 +148,16 @@
 #endif
 
 /*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching.  Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms.  If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
  * USE_SSL code should be compiled only when compiling with an SSL
  * implementation.  (Currently, only OpenSSL is supported, but we might add
  * more implementations in the future.)
-- 
1.9.1

From 3591a7bb61b5bba8df26cac5b74ca2b954c50708 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <p...@bowt.ie>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/3] Quicksort when performing external sorts

The benefits of replacement selection (the current run + next run
priority queue, which tuplesort.c implements as a heap) are seen only
where incremental spilling allows the writing of one large run, avoiding
any final merge step.  It is not generally worth the cost in CPU cache
misses to use a heap, even when it can produce runs that are much larger
than with the new approach.  A simple hybrid sort-merge strategy is
therefore now usually used when tuplesort performs an external sort,
where quicksort is used to sort runs strictly bound in size by workMem.

Replacement selection can still occasionally be used for the first run,
in the hope of getting the "one run, no merge" best case for the
algorithm.  A new GUC, replacement_sort_mem, governs if replacement
selection is still attempted for just the first run.

Commit df700e6b set merge order based on available buffer space (the
number of tapes was as high as possible while still allowing at least 32
* BLCKSZ buffer space per tape), rejecting Knuth's theoretically
justified "sweet spot" of 7 tapes (a merge order of 6 -- Knuth's
P), improving performance when the sort thereby completed in one pass.
However, it's still true that there are unlikely to be benefits from
increasing the number of tapes past 7 once the amount of data to be
sorted significantly exceeds available memory; that commit probably
mostly just improved matters where it enabled all merging to be done in
a final on-the-fly merge.

A problem with the merge order logic established by that commit is that
with large work_mem settings and data volumes, the tapes previously
wasted as much as 8% of the available memory budget; tens of thousands
of tapes could be logically allocated for a sort that will only benefit
from a few dozen.  So, as a further optimization, put a quasi-arbitrary
cap of 501 on the number of tapes that tuplesort will ever use (i.e.
cap merge order at 500 inclusive).
---
 doc/src/sgml/config.sgml                      |  51 +++
 src/backend/optimizer/path/costsize.c         |  14 +-
 src/backend/utils/init/globals.c              |   1 +
 src/backend/utils/misc/guc.c                  |  13 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/backend/utils/sort/tuplesort.c            | 495 +++++++++++++++++++++-----
 src/include/miscadmin.h                       |   1 +
 7 files changed, 475 insertions(+), 101 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 7695ec1..93650cd 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1472,6 +1472,57 @@ include_dir 'conf.d'
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-replacement-sort-mem" xreflabel="replacement_sort_mem">
+      <term><varname>replacement_sort_mem</varname> (<type>integer</type>)
+      <indexterm>
+       <primary><varname>replacement_sort_mem</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Specifies an inclusive cap on the amount of memory (as
+        specified by <varname>work_mem</varname> or
+        <varname>maintenance_work_mem</varname>) at which external
+        sort operations (including those performed by <command>CREATE
+        INDEX</> and <command>CLUSTER</>) generate their first run
+        using a <firstterm>replacement selection</firstterm> priority
+        queue.  This can be useful in memory constrained environments
+        where rows that are input into a sort operation are known to
+        have a strong correlation in logical order (their final order)
+        to physical input order.  Note that this does not include
+        cases with an <emphasis>inverse</emphasis> correlation. It is
+        possible for the replacement selection algorithm to generate
+        one long run that requires no merging, where the default
+        strategy would result in many runs that must be merged to
+        produce a final sorted output.  This may allow sort operations
+        to complete sooner.
+       </para>
+       <para>
+        The value defaults to sixteen megabytes (<literal>16MB</>).
+        Note that higher values are typically not more effective, and
+        may be counter-productive.  The default
+        <varname>maintenance_work_mem</varname> value effectively
+        prevents utility command external sorts (e.g., sorts used by
+        <command>CREATE INDEX</> when a B-Tree index is built) from
+        ever being affected, provided <varname>replacement_sort_mem</>
+        is itself set to its default value.  One strategy that may be
+        effective is to temporarily set
+        <varname>maintenance_work_mem</varname> to a lower setting,
+        equal to <varname>replacement_sort_mem</>, when a correlation
+        is expected.
+       </para>
+       <para>
+        The priority queue can be thought of as a buffer that can
+        reorder input rows, shuffling the rows into their final,
+        logical order.  Rows are incrementally output,  unless and
+        until a row is input <emphasis>after</> a row with a higher
+        sort order has already been output, at which point replacement
+        selection is abandoned.  When replacement selection is
+        abandoned, the sort is completed using the default strategy.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-autovacuum-work-mem" xreflabel="autovacuum_work_mem">
       <term><varname>autovacuum_work_mem</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 943fcde..dffb8b2 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -1421,16 +1421,16 @@ cost_recursive_union(Path *runion, Path *nrterm, Path *rterm)
  * total, but we will also need to write and read each tuple once per
  * merge pass.  We expect about ceil(logM(r)) merge passes where r is the
  * number of initial runs formed and M is the merge order used by tuplesort.c.
- * Since the average initial run should be about twice sort_mem, we have
- *		disk traffic = 2 * relsize * ceil(logM(p / (2*sort_mem)))
+ * Since the average initial run should be about sort_mem, we have
+ *		disk traffic = 2 * relsize * ceil(logM(p / sort_mem))
  *		cpu = comparison_cost * t * log2(t)
  *
  * If the sort is bounded (i.e., only the first k result tuples are needed)
  * and k tuples can fit into sort_mem, we use a heap method that keeps only
  * k tuples in the heap; this will require about t*log2(k) tuple comparisons.
  *
- * The disk traffic is assumed to be 3/4ths sequential and 1/4th random
- * accesses (XXX can't we refine that guess?)
+ * The disk traffic is assumed to be 90% sequential and 10% random
+ * accesses.
  *
  * By default, we charge two operator evals per tuple comparison, which should
  * be in the right ballpark in most cases.  The caller can tweak this by
@@ -1498,7 +1498,7 @@ cost_sort(Path *path, PlannerInfo *root,
 		 * We'll have to use a disk-based sort of all the tuples
 		 */
 		double		npages = ceil(input_bytes / BLCKSZ);
-		double		nruns = (input_bytes / sort_mem_bytes) * 0.5;
+		double		nruns = input_bytes / sort_mem_bytes;
 		double		mergeorder = tuplesort_merge_order(sort_mem_bytes);
 		double		log_runs;
 		double		npageaccesses;
@@ -1518,9 +1518,9 @@ cost_sort(Path *path, PlannerInfo *root,
 		else
 			log_runs = 1.0;
 		npageaccesses = 2.0 * npages * log_runs;
-		/* Assume 3/4ths of accesses are sequential, 1/4th are not */
+		/* Assume 90% of accesses are sequential, 10% are not */
 		startup_cost += npageaccesses *
-			(seq_page_cost * 0.75 + random_page_cost * 0.25);
+			(seq_page_cost * 0.90 + random_page_cost * 0.10);
 	}
 	else if (tuples > 2 * output_tuples || input_bytes > sort_mem_bytes)
 	{
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 597dab4..1aa2b23 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -109,6 +109,7 @@ bool		enableFsync = true;
 bool		allowSystemTableMods = false;
 int			work_mem = 1024;
 int			maintenance_work_mem = 16384;
+int			replacement_sort_mem = 16384;
 
 /*
  * Primary determinants of sizes of shared-memory structures.
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 5a3a40e..9884f32 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -1918,6 +1918,19 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"replacement_sort_mem", PGC_USERSET, RESOURCES_MEM,
+			gettext_noop("Sets the maximum memory to be used for sorting priority queue."),
+			gettext_noop("This much memory sets the maximum size of the replacement selection "
+						 "priority queue used by external sort operations, including utility "
+						 "commands and query workspace sort operations."),
+			GUC_UNIT_KB
+		},
+		&replacement_sort_mem,
+		16384, 64, MAX_KILOBYTES,
+		NULL, NULL, NULL
+	},
+
 	/*
 	 * We use the hopefully-safely-small value of 100kB as the compiled-in
 	 * default for max_stack_depth.  InitializeGUCOptions will increase it if
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 773b4e8..60194d3 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -125,6 +125,7 @@
 # actively intend to use prepared transactions.
 #work_mem = 4MB				# min 64kB
 #maintenance_work_mem = 64MB		# min 1MB
+#replacement_sort_mem = 16MB		# min 64kB
 #autovacuum_work_mem = -1		# min 1MB, or -1 to use maintenance_work_mem
 #max_stack_depth = 2MB			# min 100kB
 #dynamic_shared_memory_type = posix	# the default is the first option
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d033c95..27ca7b7 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -11,14 +11,16 @@
  * algorithm.
  *
  * See Knuth, volume 3, for more than you want to know about the external
- * sorting algorithm.  We divide the input into sorted runs using replacement
- * selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * sorting algorithm.  Historically, we divided the input into sorted runs
+ * using replacement selection, in the form of a priority tree implemented
+ * as a heap (essentially his Algorithm 5.2.3H -- although that strategy is
+ * often avoided altogether), but that can now only happen first the first
+ * run.  We merge the runs using polyphase merge, Knuth's Algorithm
+ * 5.4.2D.  The logical "tapes" used by Algorithm D are implemented by
+ * logtape.c, which avoids space wastage by recycling disk space as soon
+ * as each block is read from its "tape".
  *
- * We do not form the initial runs using Knuth's recommended replacement
+ * We never form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
  * number of records in memory at all times.  Since we are dealing with
  * tuples that may vary considerably in size, we want to be able to vary
@@ -28,7 +30,30 @@
  * Algorithm 5.4.1R, each record is stored with the run number that it
  * must go into, and we use (run number, key) as the ordering key for the
  * heap.  When the run number at the top of the heap changes, we know that
- * no more records of the prior run are left in the heap.
+ * no more records of the prior run are left in the heap.  Note that there
+ * are in practice only ever two distinct run numbers, due to the greatly
+ * reduced use of replacement selection in PostgreSQL 9.6.
+ *
+ * In PostgreSQL 9.6, a heap (based on Knuth's Algorithm H, with some small
+ * customizations) is only used with the aim of producing just one run,
+ * thereby avoiding all merging.  Only the first run can use replacement
+ * selection, which is why there are now only two possible valid run
+ * numbers, and why heapification is customized to not distinguish between
+ * tuples in the second run (those will be quicksorted).  We generally
+ * prefer a simple hybrid sort-merge strategy, where runs are sorted in much
+ * the same way as the entire input of an internal sort is sorted (using
+ * qsort()).  The replacement_sort_mem GUC controls the limited remaining
+ * use of replacement selection for the first run.
+ *
+ * There are several reasons to favor a hybrid sort-merge strategy.
+ * Maintaining a priority tree/heap has poor CPU cache characteristics.
+ * Furthermore, the growth in main memory sizes has greatly diminished the
+ * value of having runs that are larger than available memory, even in the
+ * case where there is partially sorted input and runs can be made far
+ * larger by using a heap.  In most cases, a single-pass merge step is all
+ * that is required even when runs are no larger than available memory.
+ * Avoiding multiple merge passes was traditionally considered to be the
+ * major advantage of using replacement selection.
  *
  * The approximate amount of memory allowed for any one sort operation
  * is specified in kilobytes by the caller (most pass work_mem).  Initially,
@@ -36,13 +61,12 @@
  * we haven't exceeded workMem.  If we reach the end of the input without
  * exceeding workMem, we sort the array using qsort() and subsequently return
  * tuples just by scanning the tuple array sequentially.  If we do exceed
- * workMem, we construct a heap using Algorithm H and begin to emit tuples
- * into sorted runs in temporary tapes, emitting just enough tuples at each
- * step to get back within the workMem limit.  Whenever the run number at
- * the top of the heap changes, we begin a new run with a new output tape
- * (selected per Algorithm D).  After the end of the input is reached,
- * we dump out remaining tuples in memory into a final run (or two),
- * then merge the runs using Algorithm D.
+ * workMem, we begin to emit tuples into sorted runs in temporary tapes.
+ * When tuples are dumped in batch after quicksorting, we begin a new run
+ * with a new output tape (selected per Algorithm D).  After the end of the
+ * input is reached, we dump out remaining tuples in memory into a final run
+ * (or two, when replacement selection is still used), then merge the runs
+ * using Algorithm D.
  *
  * When merging runs, we use a heap containing just the frontmost tuple from
  * each source run; we repeatedly output the smallest tuple and insert the
@@ -84,7 +108,7 @@
  * code we determine the number of tapes M on the basis of workMem: we want
  * workMem/M to be large enough that we read a fair amount of data each time
  * we preread from a tape, so as to maintain the locality of access described
- * above.  Nonetheless, with large workMem we can have many tapes.
+ * above.  Nonetheless, with large workMem we can have a few hundred tapes.
  *
  *
  * Portions Copyright (c) 1996-2016, PostgreSQL Global Development Group
@@ -136,7 +160,7 @@ bool		optimize_bounded_sort = true;
 
 /*
  * The objects we actually sort are SortTuple structs.  These contain
- * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
+ * a pointer to the tuple itself (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
  * can be freed by a simple pfree() (except during final on-the-fly merge,
  * when memory is used in batch).  SortTuples also contain the tuple's
@@ -162,15 +186,18 @@ bool		optimize_bounded_sort = true;
  * described above.  Accordingly, "tuple" is always used in preference to
  * datum1 as the authoritative value for pass-by-reference cases.
  *
- * While building initial runs, tupindex holds the tuple's run number.  During
- * merge passes, we re-use it to hold the input tape number that each tuple in
- * the heap was read from, or to hold the index of the next tuple pre-read
- * from the same tape in the case of pre-read entries.  tupindex goes unused
- * if the sort occurs entirely in memory.
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
+ * selection is always abandoned after the first run; no other run number
+ * should be represented here.  During merge passes, we re-use it to hold the
+ * input tape number that each tuple in the heap was read from, or to hold the
+ * index of the next tuple pre-read from the same tape in the case of pre-read
+ * entries.  tupindex goes unused if the sort occurs entirely in memory.
  */
 typedef struct
 {
-	void	   *tuple;			/* the tuple proper */
+	void	   *tuple;			/* the tuple itself */
 	Datum		datum1;			/* value of first key column */
 	bool		isnull1;		/* is first key column NULL? */
 	int			tupindex;		/* see notes above */
@@ -203,9 +230,19 @@ typedef enum
  * tape during a preread cycle (see discussion at top of file).
  */
 #define MINORDER		6		/* minimum merge order */
+#define MAXORDER		500		/* maximum merge order */
 #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
 #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
 
+ /*
+  * Run numbers, used during external sort operations.
+  *
+  * HEAP_RUN_NEXT is only used for SortTuple.tupindex, never state.currentRun.
+  */
+#define RUN_FIRST		0
+#define HEAP_RUN_NEXT	INT_MAX
+#define RUN_SECOND		1
+
 typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
 												Tuplesortstate *state);
 
@@ -293,8 +330,16 @@ struct Tuplesortstate
 	bool		batchUsed;
 
 	/*
+	 * While building initial runs, this indicates if the replacement
+	 * selection strategy is in use.  When it isn't, then a simple hybrid
+	 * sort-merge strategy is in use instead (runs are quicksorted).
+	 */
+	bool		replaceActive;
+
+	/*
 	 * While building initial runs, this is the current output run number
-	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
+	 * (starting at RUN_FIRST).  Afterwards, it is the number of initial
+	 * runs we made.
 	 */
 	int			currentRun;
 
@@ -493,6 +538,7 @@ struct Tuplesortstate
 static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
@@ -508,8 +554,10 @@ static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
 static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 					  int tupleindex, bool checkIndex);
 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -654,7 +702,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 	if (LACKMEM(state))
 		elog(ERROR, "insufficient memory allowed for sort");
 
-	state->currentRun = 0;
+	state->currentRun = RUN_FIRST;
 
 	/*
 	 * maxTapes, tapeRange, and Algorithm D variables will be initialized by
@@ -1566,22 +1614,61 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 
 			/*
 			 * Insert the tuple into the heap, with run number currentRun if
-			 * it can go into the current run, else run number currentRun+1.
-			 * The tuple can go into the current run if it is >= the first
+			 * it can go into the current run, else HEAP_RUN_NEXT.  The tuple
+			 * can go into the current run if it is >= the first
 			 * not-yet-output tuple.  (Actually, it could go into the current
 			 * run if it is >= the most recently output tuple ... but that
 			 * would require keeping around the tuple we last output, and it's
 			 * simplest to let writetup free each tuple as soon as it's
 			 * written.)
 			 *
-			 * Note there will always be at least one tuple in the heap at
-			 * this point; see dumptuples.
+			 * Note that this only applies when:
+			 *
+			 * - currentRun is RUN_FIRST
+			 *
+			 * - Replacement selection is in use (typically it is never used).
+			 *
+			 * When these two conditions are not both true, all tuples are
+			 * appended indifferently, much like the TSS_INITIAL case.
+			 *
+			 * There should always be room to store the incoming tuple.
 			 */
-			Assert(state->memtupcount > 0);
-			if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			Assert(!state->replaceActive || state->memtupcount > 0);
+			if (state->replaceActive &&
+				COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			{
+				Assert(state->currentRun == RUN_FIRST);
+
+				/*
+				 * Insert tuple into first, fully heapified run.
+				 *
+				 * Unlike classic replacement selection, which this module was
+				 * previously based on, only RUN_FIRST tuples are fully
+				 * heapified.  Any second/next run tuples are appended
+				 * indifferently.  While HEAP_RUN_NEXT tuples may be sifted
+				 * out of the way of first run tuples, COMPARETUP() will never
+				 * be called for the run's tuples during sifting (only our
+				 * initial COMPARETUP() call is required for the tuple, to
+				 * determine that the tuple does not belong in RUN_FIRST).
+				 */
 				tuplesort_heap_insert(state, tuple, state->currentRun, true);
+			}
 			else
-				tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+			{
+				/*
+				 * Tuple was determined to not belong to heapified RUN_FIRST,
+				 * or replacement selection not in play.  Append the tuple to
+				 * memtuples indifferently.
+				 *
+				 * dumptuples() does not trust that the next run's tuples are
+				 * heapified.  Anything past the first run will always be
+				 * quicksorted even when replacement selection is initially
+				 * used.  (When it's never used, every tuple still takes this
+				 * path.)
+				 */
+				tuple->tupindex = HEAP_RUN_NEXT;
+				state->memtuples[state->memtupcount++] = *tuple;
+			}
 
 			/*
 			 * If we are over the memory limit, dump tuples till we're under.
@@ -1656,20 +1743,9 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * We were able to accumulate all the tuples within the allowed
-			 * amount of memory.  Just qsort 'em and we're done.
+			 * amount of memory.  Just quicksort 'em and we're done.
 			 */
-			if (state->memtupcount > 1)
-			{
-				/* Can we use the single-key sort function? */
-				if (state->onlyKey != NULL)
-					qsort_ssup(state->memtuples, state->memtupcount,
-							   state->onlyKey);
-				else
-					qsort_tuple(state->memtuples,
-								state->memtupcount,
-								state->comparetup,
-								state);
-			}
+			tuplesort_sort_memtuples(state);
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -2175,13 +2251,48 @@ tuplesort_merge_order(int64 allowedMem)
 	mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) /
 		(MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD);
 
-	/* Even in minimum memory, use at least a MINORDER merge */
+	/*
+	 * Even in minimum memory, use at least a MINORDER merge.  Also cap the
+	 * maximum merge order to MAXORDER.
+	 *
+	 * When allowedMem is significantly lower than what is required for an
+	 * internal sort, it is unlikely that there are benefits to increasing the
+	 * number of tapes beyond Knuth's "sweet spot" of 7.  Furthermore, in the
+	 * common case where there turns out to be less than MAXORDER initial runs,
+	 * the overhead per-tape if we were not to cap could be significant with
+	 * high allowedMem.  Significantly more tapes can be useful if they can
+	 * enable doing all merging on-the-fly, but the merge heap is rather cache
+	 * inefficient if there are too many tapes (with one run); multiple passes
+	 * seem preferable.
+	 */
 	mOrder = Max(mOrder, MINORDER);
+	mOrder = Min(mOrder, MAXORDER);
 
 	return mOrder;
 }
 
 /*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+	int64		replacementSortMemBytes =
+		replacement_sort_mem * (int64) 1024;
+
+	/*
+	 * It can sometimes be useful to use a replacement selection heap if it
+	 * results in one large run, and if there is very little workMem.  See
+	 * remarks on RUN_SECOND optimization within mergeruns().
+	 */
+	if (state->allowedMem <= replacementSortMemBytes)
+		return true;
+
+	return false;
+}
+
+/*
  * inittapes - initialize for tape sorting.
  *
  * This is called only if we have found we don't have room to sort in memory.
@@ -2190,7 +2301,6 @@ static void
 inittapes(Tuplesortstate *state)
 {
 	int			maxTapes,
-				ntuples,
 				j;
 	int64		tapeSpace;
 
@@ -2253,25 +2363,37 @@ inittapes(Tuplesortstate *state)
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
 
 	/*
-	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
-	 * marked as belonging to run number zero.
-	 *
-	 * NOTE: we pass false for checkIndex since there's no point in comparing
-	 * indexes in this step, even though we do intend the indexes to be part
-	 * of the sort key...
+	 * Give replacement selection a try based on user setting.  There will
+	 * be a switch to a simple hybrid sort-merge strategy after the first
+	 * run (iff we could not output one long run).
 	 */
-	ntuples = state->memtupcount;
-	state->memtupcount = 0;		/* make the heap empty */
-	for (j = 0; j < ntuples; j++)
+	state->replaceActive = useselection(state);
+
+	if (state->replaceActive)
 	{
-		/* Must copy source tuple to avoid possible overwrite */
-		SortTuple	stup = state->memtuples[j];
+		/*
+		 * Convert the unsorted contents of memtuples[] into a heap. Each
+		 * tuple is marked as belonging to run number zero.
+		 *
+		 * NOTE: we pass false for checkIndex since there's no point in
+		 * comparing indexes in this step, even though we do intend the
+		 * indexes to be part of the sort key...
+		 */
+		int			ntuples = state->memtupcount;
 
-		tuplesort_heap_insert(state, &stup, 0, false);
+		state->memtupcount = 0;		/* make the heap empty */
+
+		for (j = 0; j < ntuples; j++)
+		{
+			/* Must copy source tuple to avoid possible overwrite */
+			SortTuple	stup = state->memtuples[j];
+
+			tuplesort_heap_insert(state, &stup, 0, false);
+		}
+		Assert(state->memtupcount == ntuples);
 	}
-	Assert(state->memtupcount == ntuples);
 
-	state->currentRun = 0;
+	state->currentRun = RUN_FIRST;
 
 	/*
 	 * Initialize variables of Algorithm D (step D1).
@@ -2362,11 +2484,12 @@ mergeruns(Tuplesortstate *state)
 
 	/*
 	 * If we produced only one initial run (quite likely if the total data
-	 * volume is between 1X and 2X workMem), we can just use that tape as the
-	 * finished output, rather than doing a useless merge.  (This obvious
-	 * optimization is not in Knuth's algorithm.)
+	 * volume is between 1X and 2X workMem when replacement selection is used,
+	 * but something we particular count on when input is presorted), we can
+	 * just use that tape as the finished output, rather than doing a useless
+	 * merge.  (This obvious optimization is not in Knuth's algorithm.)
 	 */
-	if (state->currentRun == 1)
+	if (state->currentRun == RUN_SECOND)
 	{
 		state->result_tape = state->tp_tapenum[state->destTape];
 		/* must freeze and rewind the finished output tape */
@@ -2610,7 +2733,12 @@ beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 
 	if (finalMergeBatch)
 	{
-		/* Free outright buffers for tape never actually allocated */
+		/*
+		 * Free outright buffers for tape never actually allocated.  The
+		 * !state->tuples case is actually entitled to have at least this much
+		 * of a refund ahead of its final merge, but we don't go out of our way
+		 * to marginally improve that case.
+		 */
 		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
 
 		/*
@@ -3094,21 +3222,25 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 }
 
 /*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
  *
  * This is used during initial-run building, but not during merging.
  *
- * When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to).  We also
- * insist there be at least one free slot in the memtuples[] array.
+ * When alltuples = false and replacement selection is still active, dump
+ * only enough tuples to get under the availMem limit (and leave at least
+ * one tuple in memtuples, since puttuple will then assume it is a heap that
+ * has a tuple to compare to).  We always insist there be at least one free
+ * slot in the memtuples[] array.
  *
- * When alltuples = true, dump everything currently in memory.
- * (This case is only used at end of input data.)
+ * When alltuples = true, dump everything currently in memory.  (This
+ * case is only used at end of input data, although in practice only the
+ * first run could fail to dump all tuples when we LACKMEM(), and only
+ * when replacement selection is active.)
  *
- * If we empty the heap, close out the current run and return (this should
- * only happen at end of input data).  If we see that the tuple run number
- * at the top of the heap has changed, start a new run.
+ * If, when replacement selection is active, we see that the tuple run
+ * number at the top of the heap has changed, start a new run.  This must be
+ * the first run, because replacement selection is always abandoned for all
+ * further runs.
  */
 static void
 dumptuples(Tuplesortstate *state, bool alltuples)
@@ -3117,47 +3249,190 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		   (LACKMEM(state) && state->memtupcount > 1) ||
 		   state->memtupcount >= state->memtupsize)
 	{
-		/*
-		 * Dump the heap's frontmost entry, and sift up to remove it from the
-		 * heap.
-		 */
-		Assert(state->memtupcount > 0);
-		WRITETUP(state, state->tp_tapenum[state->destTape],
-				 &state->memtuples[0]);
-		tuplesort_heap_siftup(state, true);
+		if (state->replaceActive)
+		{
+			/*
+			 * Still holding out for a case favorable to replacement selection.
+			 * Still incrementally spilling using heap.
+			 *
+			 * Dump the heap's frontmost entry, and sift up to remove it from
+			 * the heap.
+			 */
+			Assert(state->memtupcount > 0);
+			WRITETUP(state, state->tp_tapenum[state->destTape],
+					 &state->memtuples[0]);
+			tuplesort_heap_siftup(state, true);
+		}
+		else
+		{
+			/*
+			 * Once committed to quicksorting runs, never incrementally
+			 * spill
+			 */
+			dumpbatch(state, alltuples);
+			break;
+		}
 
 		/*
-		 * If the heap is empty *or* top run number has changed, we've
-		 * finished the current run.
+		 * If top run number has changed, we've finished the current run
+		 * (this can only be the first run), and will no longer spill
+		 * incrementally.
 		 */
 		if (state->memtupcount == 0 ||
-			state->currentRun != state->memtuples[0].tupindex)
+			state->memtuples[0].tupindex == HEAP_RUN_NEXT)
 		{
 			markrunend(state, state->tp_tapenum[state->destTape]);
+			Assert(state->currentRun == RUN_FIRST);
 			state->currentRun++;
 			state->tp_runs[state->destTape]++;
 			state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
 
 #ifdef TRACE_SORT
 			if (trace_sort)
-				elog(LOG, "finished writing%s run %d to tape %d: %s",
-					 (state->memtupcount == 0) ? " final" : "",
+				elog(LOG, "finished incrementally writing %s run %d to tape %d: %s",
+					 (state->memtupcount == 0) ? "only" : "first",
 					 state->currentRun, state->destTape,
 					 pg_rusage_show(&state->ru_start));
 #endif
-
 			/*
-			 * Done if heap is empty, else prepare for new run.
+			 * Done if heap is empty, which is possible when there is only one
+			 * long run.
 			 */
+			Assert(state->currentRun == RUN_SECOND);
 			if (state->memtupcount == 0)
+			{
+				/*
+				 * Replacement selection best case; no final merge required,
+				 * because there was only one initial run (second run has no
+				 * tuples).  See RUN_SECOND case in mergeruns().
+				 */
 				break;
-			Assert(state->currentRun == state->memtuples[0].tupindex);
+			}
+
+			/*
+			 * Abandon replacement selection for second run (as well as any
+			 * subsequent runs).
+			 */
+			state->replaceActive = false;
+
+			/*
+			 * First tuple of next run should not be heapified, and so will
+			 * bear placeholder run number.  In practice this must actually be
+			 * the second run, which just became the currentRun, so we're
+			 * clear to quicksort and dump the tuples in batch next time
+			 * memtuples becomes full.
+			 */
+			Assert(state->memtuples[0].tupindex == HEAP_RUN_NEXT);
 			selectnewtape(state);
 		}
 	}
 }
 
 /*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Second or subsequent runs are never heapified by this module (although
+ * heapification still respects run number differences between the first and
+ * second runs), and a heap (replacement selection priority queue) is often
+ * avoided in the first place.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+	int			memtupwrite;
+	int			i;
+
+	Assert(state->status == TSS_BUILDRUNS);
+
+	/*
+	 * Final call might require no sorting, in rare cases where we just so
+	 * happen to have previously LACKMEM()'d at the point where exactly all
+	 * remaining tuples are loaded into memory, just before input was
+	 * exhausted.
+	 *
+	 * In general, short final runs are quite possible.  Rather than
+	 * allowing a special case where there was a superfluous
+	 * selectnewtape() call (i.e. a call with no subsequent run actually
+	 * written to destTape), we prefer to write out a 0 tuple run.
+	 *
+	 * mergepreread()/mergeprereadone() are prepared for 0 tuple runs, and
+	 * will reliably mark the tape inactive for the merge when called from
+	 * beginmerge().  This case is therefore similar to the case where
+	 * mergeonerun() finds a dummy run for the tape, and so doesn't need to
+	 * merge a run from the tape (or conceptually "merges" the dummy run,
+	 * if you prefer).  According to Knuth, Algorithm D "isn't strictly
+	 * optimal" in its method of distribution and dummy run assignment;
+	 * this edge case seems very unlikely to make that appreciably worse.
+	 *
+	 * When this edge case hasn't occurred, the first memtuple should not
+	 * be found to be heapified (nor should any other memtuple).
+	 */
+	Assert(state->memtupcount == 0 ||
+		   state->memtuples[0].tupindex == HEAP_RUN_NEXT);
+
+	/*
+	 * It seems unlikely that this limit will ever be exceeded, but take no
+	 * chances
+	 */
+	if (state->currentRun == INT_MAX)
+		ereport(ERROR,
+				(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+				 errmsg("cannot have more than %d runs for an external sort",
+						INT_MAX)));
+
+	state->currentRun++;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "starting quicksort of run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	/*
+	 * Sort all tuples accumulated within the allowed amount of memory for this
+	 * run using quicksort
+	 */
+	tuplesort_sort_memtuples(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksort of run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	memtupwrite = state->memtupcount;
+	for (i = 0; i < memtupwrite; i++)
+	{
+		WRITETUP(state, state->tp_tapenum[state->destTape],
+				 &state->memtuples[i]);
+		state->memtupcount--;
+	}
+
+	/*
+	 * Reset tuple memory.  We've freed all of the tuples that we previously
+	 * allocated.  It's important to avoid fragmentation when there is a stark
+	 * change in allocation patterns due to the use of batch memory.
+	 * Fragmentation due to AllocSetFree's bucketing by size class might be
+	 * particularly bad if this step wasn't taken.
+	 */
+	MemoryContextReset(state->tuplecontext);
+
+	markrunend(state, state->tp_tapenum[state->destTape]);
+	state->tp_runs[state->destTape]++;
+	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished writing run %d to tape %d: %s",
+			 state->currentRun, state->destTape,
+			 pg_rusage_show(&state->ru_start));
+#endif
+
+	if (!alltuples)
+		selectnewtape(state);
+}
+
+/*
  * tuplesort_rescan		- rewind and replay the scan
  */
 void
@@ -3315,10 +3590,15 @@ tuplesort_get_stats(Tuplesortstate *state,
  *
  * Compare two SortTuples.  If checkIndex is true, use the tuple index
  * as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never
+ * maintained beyond the first run, and so there are no COMPARETUP()
+ * calls needed to distinguish tuples in HEAP_RUN_NEXT.
  */
 
 #define HEAPCOMPARE(tup1,tup2) \
-	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+					(tup1)->tupindex == HEAP_RUN_NEXT) ? \
 	 ((tup1)->tupindex) - ((tup2)->tupindex) : \
 	 COMPARETUP(state, tup1, tup2))
 
@@ -3417,6 +3697,31 @@ sort_bounded_heap(Tuplesortstate *state)
 }
 
 /*
+ * Sort all memtuples using specialized qsort() routines.
+ *
+ * Quicksort is used for small in-memory sorts.  Quicksort is also generally
+ * preferred to replacement selection for generating runs during external sort
+ * operations, although replacement selection is sometimes used for the first
+ * run.
+ */
+static void
+tuplesort_sort_memtuples(Tuplesortstate *state)
+{
+	if (state->memtupcount > 1)
+	{
+		/* Can we use the single-key sort function? */
+		if (state->onlyKey != NULL)
+			qsort_ssup(state->memtuples, state->memtupcount,
+					   state->onlyKey);
+		else
+			qsort_tuple(state->memtuples,
+						state->memtupcount,
+						state->comparetup,
+						state);
+	}
+}
+
+/*
  * Insert a new tuple into an empty or existing heap, maintaining the
  * heap invariant.  Caller is responsible for ensuring there's room.
  *
@@ -3443,6 +3748,7 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 
 	memtuples = state->memtuples;
 	Assert(state->memtupcount < state->memtupsize);
+	Assert(!checkIndex || tupleindex == RUN_FIRST);
 
 	CHECK_FOR_INTERRUPTS();
 
@@ -3475,6 +3781,7 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
 	int			i,
 				n;
 
+	Assert(!checkIndex || state->currentRun == RUN_FIRST);
 	if (--state->memtupcount <= 0)
 		return;
 
@@ -3760,7 +4067,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
-	/* read in the tuple proper */
+	/* read in the tuple itself */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tupbody, tupbodylen);
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9200f04..0d90435 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -239,6 +239,7 @@ extern bool enableFsync;
 extern bool allowSystemTableMods;
 extern PGDLLIMPORT int work_mem;
 extern PGDLLIMPORT int maintenance_work_mem;
+extern PGDLLIMPORT int replacement_sort_mem;
 
 extern int	VacuumCostPageHit;
 extern int	VacuumCostPageMiss;
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to