On Fri, Dec 18, 2015 at 11:57 AM, Peter Geoghegan <p...@heroku.com> wrote:
> BTW, I'm not necessarily determined to make the new special-purpose
> allocator work exactly as proposed. It seemed useful to prioritize
> simplicity, and currently so there is one big "huge palloc()" with
> which we blow our memory budget, and that's it. However, I could
> probably be more clever about "freeing ranges" initially preserved for
> a now-exhausted tape. That kind of thing.

Attached is a revision that significantly overhauls the memory patch,
with several smaller changes.

We can now grow memtuples to rebalance the size of the array
(memtupsize) against the need for memory for tuples. Doing this makes
a big difference with a 500MB work_mem setting in this datum sort
case, as my newly expanded trace_sort instrumentation shows:

LOG:  grew memtuples 1.40x from 9362286 (219429 KB) to 13107200
(307200 KB) for final merge
LOG:  tape 0 initially used 34110 KB of 34110 KB batch (1.000) and
13107200 slots remaining
LOG:  tape 1 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG:  tape 2 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining
LOG:  tape 3 initially used 34110 KB of 34110 KB batch (1.000) and has
1533 slots remaining
LOG:  tape 4 initially used 34110 KB of 34110 KB batch (1.000) and has
1534 slots remaining
LOG:  tape 5 initially used 34110 KB of 34110 KB batch (1.000) and has
1535 slots remaining

This is a big improvement. With the new batchmemtuples() call
commented out (i.e. no new grow_memtuples() call), the LOG output
around the same point is:

LOG:  tape 0 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 1 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 2 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 3 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 4 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining
LOG:  tape 5 initially used 24381 KB of 48738 KB batch (0.500) and has
1 slots remaining

(I actually added a bit more detail to what you see here during final clean-up)

Obviously we're using memory a lot more efficiently here as compared
to my last revision (or the master branch -- it always has palloc()
overhead, of course). With no grow_memtuples, we're not wasting ~1530
slots per tape anymore (which is a tiny fraction of 1% of the total),
but we are wasting 50% of all batch memory, or almost 30% of all
work_mem.

Note that this improvement is possible despite the fact that memory is
still MAXALIGN()'d -- I'm mostly just clawing back what I can, having
avoided much STANDARDCHUNKHEADERSIZE overhead for the final on-the-fly
merge. I tend to think that the bigger problem here is that we use so
many memtuples when merging in the first place though (e.g. 60% in the
above case), because memtuples are much less useful than something
like a simple array of pointers when merging; I can certainly see why
you'd need 6 memtuples here, for the merge heap, but the other ~13
million seem mostly unnecessary. Anyway, what I have now is as far as
I want to go to accelerate merging for 9.6, since parallel CREATE
INDEX is where the next big win will come from. As wasteful as this
can be, I think it's of secondary importance.

With this revision, I've given up on the idea of trying to map
USEMEM()/FREEMEM() to "logical" allocations and deallocations that
consume from each tape's batch. The existing merge code in the master
branch is concerned exclusively with making each tape's use of memory
fair; each tape only gets so many "slots" (memtuples), and so much
memory, and that's it (there is never any shuffling of those resource
budgets between tapes). I get the same outcome from simply only
allowing tapes to get memory from their own batch allocation, which
isn't much complexity, because only READTUP() routines regularly need
memory. We detect when memory has been exhausted within
mergeprereadone() in a special way, not using LACKMEM() at all -- this
seems simpler. (Specifically, we use something called overflow
allocations for this purpose. This means that there are still a very
limited number of retail palloc() calls.)

This new version somewhat formalizes the idea that batch allocation
may one day have uses beyond the final on-the-fly merge phase, which
makes a lot of sense. We should really be saving a significant amount
of memory when initially sorting runs, too. This revision also
pfree()s tape memory early if the tape is exhausted early, which will
help a lot when there is a logical/physical correlation.

Overall, I'm far happier with how memory is managed in this revision,
mostly because it's easier to reason about. trace_sort now closely
monitors where memory goes, and I think that's a good idea in general.
That makes production performance problems a lot easier to reason
about -- the accounting should be available to expert users (that
enable trace_sort). I'll have little sympathy for the suggestion that
this will overwhelm users, because trace_sort is already only suitable
for experts. Besides, it isn't that complicated to figure this stuff
out, or at least gain an intuition for what might be going on based on
differences seen in a problematic case. Getting a better picture of
what "bad" looks like can guide an investigation without the DBA
necessarily understanding the underlying algorithms. At worst, it
gives them something specific to complain about here.

Other changes:

* No longer use "tuple proper" terminology. Also, memory pools are now
referred to as batch memory allocations. This is at the request of
Jeff and Robert.

* Fixed silly bug in useselection() cost model that causes "quicksort
with spillover" to never be used. The cost model is otherwise
unchanged, because I didn't come up with any bright ideas about how to
do better there. Ideas from other people are very much welcome.

* Cap the maximum number of tapes to 500. I think it's silly that the
number of tapes is currently a function of work_mem, without any
further consideration of the details of the sort, but capping is a
simpler solution than making tuplesort_merge_order() smarter. I
previously saw quite a lot of waste with high work_mem settings, with
tens of thousands of tapes that will never be used, precisely because
we have lots of memory (the justification for having, say, 40k tapes
seems to be almost an oxymoron). Tapes (or the accounting for
never-allocated tapes) could take almost 10% of all memory. Also, less
importantly, we now refund/FREEMEM() unallocated tape memory ahead of
final on-the-fly merge preallocation of batch memory.

Note that we contemplated bounding the number of tapes in the past
several times. See the commit message of c65ab0bfa9, a commit from
almost a decade ago, for an example of this. That message also
describes how "slots" (memtuples) and memory for tuples must be kept
in balance while merging, which is very much relevant to my new
grow_memtuples() call.

-- 
Peter Geoghegan
From 109991cea7d33436fa46c8e9c0fac26bb0a88ebb Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 3/3] Perform memory prefetching when writing memtuples

This patch is based on, but quite distinct to a separately submitted,
more general version which performs prefetching in several places [1].
This version now only performs prefetching of each "tuple proper" during
the writing of batches of tuples (an entire run, written following a
quicksort).  The case for prefetching each "tuple proper" at several
sites now seems weak due to difference in CPU microarchitecture.
However, it might still be that there is a consistent improvement
observable when writing out tuples, because that involves a particularly
tight inner loop, with relatively predictable processing to hide memory
latency behind.  A helpful generic prefetch hint may be possible for
this case, even if it proves impossible elsewhere.

This has been shown to appreciably help on both a POWER7 server
processor [2], and an Intel Mobile processor.

[1] https://commitfest.postgresql.org/6/305/
[2] CAM3SWZR5rv3+F3FOKf35=dti7otmmcdfoe2vogur0pddg3j...@mail.gmail.com
---
 config/c-compiler.m4               | 17 +++++++++++++++++
 configure                          | 31 +++++++++++++++++++++++++++++++
 configure.in                       |  1 +
 src/backend/utils/sort/tuplesort.c | 14 ++++++++++++++
 src/include/c.h                    | 14 ++++++++++++++
 src/include/pg_config.h.in         |  3 +++
 src/include/pg_config.h.win32      |  3 +++
 src/include/pg_config_manual.h     | 10 ++++++++++
 8 files changed, 93 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 550d034..8be2122 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -271,6 +271,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_LINK_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --------------
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index 5772d0e..0a4c305 100755
--- a/configure
+++ b/configure
@@ -11338,6 +11338,37 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
 $as_echo_n "checking for __VA_ARGS__... " >&6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index 44f832f1..339c3e6 100644
--- a/configure.in
+++ b/configure.in
@@ -1320,6 +1320,7 @@ PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_BSWAP64
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 15253ab..85268c2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -3531,6 +3531,20 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->tp_tapenum[state->destTape],
 				 &state->memtuples[i]);
 		state->memtupcount--;
+
+		/*
+		 * Perform memory prefetch of the tuple pointed to by the SortTuple
+		 * that's two places ahead of tuple just written.  Testing shows that
+		 * this significantly boosts performance.
+		 *
+		 * Don't do this for pass-by-value datum sorts, where it will never
+		 * help.  Note that hinting a NULL address does not affect correctness,
+		 * so NULL values are not a concern here.
+		 */
+#ifdef USE_MEM_PREFETCH
+		if (state->tuples && i + 2 < memtupwrite)
+			pg_read_prefetch(state->memtuples[i + 2].tuple);
+#endif
 	}
 	markrunend(state, state->tp_tapenum[state->destTape]);
 	state->tp_runs[state->destTape]++;
diff --git a/src/include/c.h b/src/include/c.h
index 8163b00..3d05ff8 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -927,6 +927,20 @@ typedef NameData *Name;
 #define pg_unreachable() abort()
 #endif
 
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_read_prefetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup with this;  any use of the intrinsic
+ * should be carefully tested.  It's okay to pass it an invalid or NULL
+ * address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_read_prefetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 16a272e..0e393ba 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -669,6 +669,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index 8566065..990c3c3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -514,6 +514,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e278fa0..4c7b1d5 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -153,6 +153,16 @@
 #endif
 
 /*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching.  Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms.  If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
  * USE_SSL code should be compiled only when compiling with an SSL
  * implementation.  (Currently, only OpenSSL is supported, but we might add
  * more implementations in the future.)
-- 
1.9.1

From 24b6290e4201aabdd99b5ed969ea4f75d0e680f4 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Mon, 24 Aug 2015 13:22:11 -0700
Subject: [PATCH 2/3] Allocate memory in batch for tuplesort merge

Allocate a large buffer of memory for use by READTUP() routines.  This
accelerates the final on-the-fly merge phase significantly.  As tapes
are pre-read, memory is reused from the tape's segmented portion,
virtually eliminating retail palloc() calls.

Sequentially consuming memory from the large batch allocation
accelerates merging because the natural order of all subsequent access
(access during merging proper) is sequential per tape; cache
characteristics are improved.  In addition, modern microarchitectures
and compilers may automatically perform prefetching with a sequential
access pattern.
---
 src/backend/utils/sort/tuplesort.c | 499 ++++++++++++++++++++++++++++++++++---
 1 file changed, 463 insertions(+), 36 deletions(-)

diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 4d39403..15253ab 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -158,7 +158,8 @@ bool		optimize_bounded_sort = true;
  * The objects we actually sort are SortTuple structs.  These contain
  * a pointer to the tuple itself (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
- * can be freed by a simple pfree().  SortTuples also contain the tuple's
+ * can be freed by a simple pfree() (except during final on-the-fly merge,
+ * when memory is used in batch).  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
  *
  * Storing the first key column lets us save heap_getattr or index_getattr
@@ -247,6 +248,7 @@ struct Tuplesortstate
 								 * tuples to return? */
 	bool		boundUsed;		/* true if we made use of a bounded heap */
 	int			bound;			/* if bounded, the maximum number of tuples */
+	bool		tuples;			/* Can SortTuple.tuple ever be set? */
 	int64		availMem;		/* remaining memory available, in bytes */
 	int64		allowedMem;		/* total memory allowed, in bytes */
 	int			maxTapes;		/* number of tapes (Knuth's T) */
@@ -308,6 +310,15 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
+	 * Memory for tuples is sometimes allocated in batch, rather than
+	 * incrementally.  This implies that incremental memory accounting has been
+	 * abandoned.  Currently, this only happens for the final on-the-fly merge
+	 * step.  Large batch allocations can store tuples (e.g. IndexTuples)
+	 * without palloc() fragmentation and other overhead.
+	 */
+	bool		batchUsed;
+
+	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy is in use.  When it isn't, then a simple hybrid
 	 * sort-merge strategy is in use instead.
@@ -349,6 +360,21 @@ struct Tuplesortstate
 	int			mergefirstfree; /* first slot never used in this merge */
 
 	/*
+	 * Per-tape batch state, when final on-the-fly merge consumes memory from
+	 * just a few large allocations.
+	 *
+	 * Aside from the general benefits of performing fewer individual retail
+	 * palloc() calls, this also helps make merging more cache efficient, since
+	 * each tape's tuples must naturally be accessed sequentially (in sorted
+	 * order).
+	 */
+	int64		spacePerTape;	/* Space (memory) for tuples (not slots) */
+	char	  **mergetuples;	/* Each tape's memory allocation */
+	char	  **mergecurrent;	/* Current offset into each tape's memory */
+	char	  **mergetail;		/* Last item's start point for each tape */
+	char	  **mergeoverflow;	/* Retail palloc() "overflow" for each tape */
+
+	/*
 	 * Variables for Algorithm D.  Note that destTape is a "logical" tape
 	 * number, ie, an index into the tp_xxx[] arrays.  Be careful to keep
 	 * "logical" and "actual" tape numbers straight!
@@ -433,9 +459,8 @@ struct Tuplesortstate
 	 * tuplesort_begin_datum and used only by the DatumTuple routines.
 	 */
 	Oid			datumType;
-	/* we need typelen and byval in order to know how to copy the Datums. */
+	/* we need typelen in order to know how to copy the Datums. */
 	int			datumTypeLen;
-	bool		datumTypeByVal;
 
 	/*
 	 * Resource snapshot for time of sort start.
@@ -449,7 +474,7 @@ struct Tuplesortstate
 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
 #define WRITETUP(state,tape,stup)	((*(state)->writetup) (state, tape, stup))
 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
-#define LACKMEM(state)		((state)->availMem < 0)
+#define LACKMEM(state)		((state)->availMem < 0 && !(state)->batchUsed)
 #define USEMEM(state,amt)	((state)->availMem -= (amt))
 #define FREEMEM(state,amt)	((state)->availMem += (amt))
 
@@ -512,7 +537,14 @@ static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
 static void mergememruns(Tuplesortstate *state);
-static void beginmerge(Tuplesortstate *state);
+static void beginmerge(Tuplesortstate *state, bool finalMerge);
+static void batchmemtuples(Tuplesortstate *state);
+static void mergebatch(Tuplesortstate *state, int64 spacePerTape);
+static void mergebatchone(Tuplesortstate *state, int srcTape,
+					  SortTuple *stup, bool *should_free);
+static void mergebatchfreetape(Tuplesortstate *state, int srcTape,
+							   SortTuple *rtup, bool *should_free);
+static void *mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
@@ -526,6 +558,7 @@ static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
 static void reversedirection(Tuplesortstate *state);
 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
 static void markrunend(Tuplesortstate *state, int tapenum);
+static void *readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen);
 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
 				Tuplesortstate *state);
 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
@@ -622,6 +655,7 @@ tuplesort_begin_common(int workMem, double rowCount, bool randomAccess)
 	state->rowCount = rowCount;
 	state->randomAccess = randomAccess;
 	state->bounded = false;
+	state->tuples = true;
 	state->boundUsed = false;
 	state->allowedMem = workMem * (int64) 1024;
 	state->availMem = state->allowedMem;
@@ -638,6 +672,7 @@ tuplesort_begin_common(int workMem, double rowCount, bool randomAccess)
 						ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
 
 	state->growmemtuples = true;
+	state->batchUsed = false;
 	state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
 
 	USEMEM(state, GetMemoryChunkSpace(state->memtuples));
@@ -978,7 +1013,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 	/* lookup necessary attributes of the datum type */
 	get_typlenbyval(datumType, &typlen, &typbyval);
 	state->datumTypeLen = typlen;
-	state->datumTypeByVal = typbyval;
+	state->tuples = !typbyval;
 
 	/* Prepare SortSupport data */
 	state->sortKeys = (SortSupport) palloc0(sizeof(SortSupportData));
@@ -1403,7 +1438,7 @@ tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
 	 * identical to stup.tuple.
 	 */
 
-	if (isNull || state->datumTypeByVal)
+	if (isNull || !state->tuples)
 	{
 		stup.datum1 = val;
 		stup.isnull1 = isNull;
@@ -1767,6 +1802,7 @@ tuplesort_performsort(Tuplesortstate *state)
  * Internal routine to fetch the next tuple in either forward or back
  * direction into *stup.  Returns FALSE if no more tuples.
  * If *should_free is set, the caller must pfree stup.tuple when done with it.
+ * Otherwise, caller should not use tuple following next call here.
  */
 static bool
 tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
@@ -1778,6 +1814,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 	{
 		case TSS_SORTEDINMEM:
 			Assert(forward || state->randomAccess);
+			Assert(!state->batchUsed);
 			*should_free = false;
 			if (forward)
 			{
@@ -1822,6 +1859,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_SORTEDONTAPE:
 			Assert(forward || state->randomAccess);
+			Assert(!state->batchUsed);
 			*should_free = true;
 			if (forward)
 			{
@@ -1907,6 +1945,7 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 
 		case TSS_MEMTAPEMERGE:
 			Assert(forward);
+			Assert(!state->batchUsed);
 			/* For now, assume tuple returned from memory */
 			*should_free = false;
 
@@ -2005,7 +2044,9 @@ just_memtuples:
 
 		case TSS_FINALMERGE:
 			Assert(forward);
-			*should_free = true;
+			Assert(state->batchUsed || !state->tuples);
+			/* For now, assume tuple is stored in tape's batch memory */
+			*should_free = false;
 
 			/*
 			 * This code should match the inner loop of mergeonerun().
@@ -2013,18 +2054,17 @@ just_memtuples:
 			if (state->memtupcount > 0)
 			{
 				int			srcTape = state->memtuples[0].tupindex;
-				Size		tuplen;
 				int			tupIndex;
 				SortTuple  *newtup;
 
+				/*
+				 * Returned tuple is still counted in our memory space most
+				 * of the time.  See mergebatchone() for discussion of why
+				 * caller may occasionally be required to free returned
+				 * tuple, and how preread memory is managed with regard to
+				 * edge cases more generally.
+				 */
 				*stup = state->memtuples[0];
-				/* returned tuple is no longer counted in our memory space */
-				if (stup->tuple)
-				{
-					tuplen = GetMemoryChunkSpace(stup->tuple);
-					state->availMem += tuplen;
-					state->mergeavailmem[srcTape] += tuplen;
-				}
 				tuplesort_heap_siftup(state, false);
 				if ((tupIndex = state->mergenext[srcTape]) == 0)
 				{
@@ -2032,15 +2072,25 @@ just_memtuples:
 					 * out of preloaded data on this tape, try to read more
 					 *
 					 * Unlike mergeonerun(), we only preload from the single
-					 * tape that's run dry.  See mergepreread() comments.
+					 * tape that's run dry, though not before preparing its
+					 * batch memory for a new round of sequential consumption.
+					 * See mergepreread() comments.
 					 */
+					if (state->batchUsed)
+						mergebatchone(state, srcTape, stup, should_free);
+
 					mergeprereadone(state, srcTape);
 
 					/*
 					 * if still no data, we've reached end of run on this tape
 					 */
 					if ((tupIndex = state->mergenext[srcTape]) == 0)
+					{
+						/* Free tape's buffer, avoiding dangling pointer */
+						if (state->batchUsed)
+							mergebatchfreetape(state, srcTape, stup, should_free);
 						return true;
+					}
 				}
 				/* pull next preread tuple from list, insert in heap */
 				newtup = &state->memtuples[tupIndex];
@@ -2096,6 +2146,8 @@ tuplesort_gettupleslot(Tuplesortstate *state, bool forward,
  * Fetch the next tuple in either forward or back direction.
  * Returns NULL if no more tuples.  If *should_free is set, the
  * caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
  */
 HeapTuple
 tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
@@ -2115,6 +2167,8 @@ tuplesort_getheaptuple(Tuplesortstate *state, bool forward, bool *should_free)
  * Fetch the next index tuple in either forward or back direction.
  * Returns NULL if no more tuples.  If *should_free is set, the
  * caller must pfree the returned tuple when done with it.
+ * If it is not set, caller should not use tuple following next
+ * call here.
  */
 IndexTuple
 tuplesort_getindextuple(Tuplesortstate *state, bool forward,
@@ -2152,7 +2206,7 @@ tuplesort_getdatum(Tuplesortstate *state, bool forward,
 		return false;
 	}
 
-	if (stup.isnull1 || state->datumTypeByVal)
+	if (stup.isnull1 || !state->tuples)
 	{
 		*val = stup.datum1;
 		*isNull = stup.isnull1;
@@ -2430,6 +2484,10 @@ inittapes(Tuplesortstate *state)
 	state->mergelast = (int *) palloc0(maxTapes * sizeof(int));
 	state->mergeavailslots = (int *) palloc0(maxTapes * sizeof(int));
 	state->mergeavailmem = (int64 *) palloc0(maxTapes * sizeof(int64));
+	state->mergetuples = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergecurrent = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergetail = (char **) palloc0(maxTapes * sizeof(char *));
+	state->mergeoverflow = (char **) palloc0(maxTapes * sizeof(char *));
 	state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
 	state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
@@ -2588,7 +2646,7 @@ mergeruns(Tuplesortstate *state)
 				/* Tell logtape.c we won't be writing anymore */
 				LogicalTapeSetForgetFreeSpace(state->tapeset);
 				/* Initialize for the final merge pass */
-				beginmerge(state);
+				beginmerge(state, state->tuples);
 				state->status = TSS_FINALMERGE;
 				return;
 			}
@@ -2680,7 +2738,7 @@ mergeonerun(Tuplesortstate *state)
 	 * Start the merge by loading one tuple from each active source tape into
 	 * the heap.  We can also decrease the input run/dummy run counts.
 	 */
-	beginmerge(state);
+	beginmerge(state, false);
 
 	/*
 	 * Execute merge by repeatedly extracting lowest tuple in heap, writing it
@@ -2822,9 +2880,12 @@ mergememruns(Tuplesortstate *state)
  * which tapes contain active input runs in mergeactive[].  Then, load
  * as many tuples as we can from each active input tape, and finally
  * fill the merge heap with the first tuple from each active tape.
+ *
+ * finalMergeBatch indicates if this is the beginning of a final on-the-fly
+ * merge where a batched allocation of tuple memory is required.
  */
 static void
-beginmerge(Tuplesortstate *state)
+beginmerge(Tuplesortstate *state, bool finalMergeBatch)
 {
 	int			activeTapes;
 	int			tapenum;
@@ -2862,6 +2923,18 @@ beginmerge(Tuplesortstate *state)
 	state->mergefreelist = 0;	/* nothing in the freelist */
 	state->mergefirstfree = activeTapes;		/* 1st slot avail for preread */
 
+	if (finalMergeBatch)
+	{
+		/* Free outright buffers for tape never actually allocated */
+		FREEMEM(state, (state->maxTapes - activeTapes) * TAPE_BUFFER_OVERHEAD);
+
+		/*
+		 * Grow memtuples one last time, since the palloc() overhead no longer
+		 * incurred can make a big difference
+		 */
+		batchmemtuples(state);
+	}
+
 	/*
 	 * Initialize space allocation to let each active input tape have an equal
 	 * share of preread space.
@@ -2869,7 +2942,7 @@ beginmerge(Tuplesortstate *state)
 	Assert(activeTapes > 0);
 	slotsPerTape = (state->memtupsize - state->mergefirstfree) / activeTapes;
 	Assert(slotsPerTape > 0);
-	spacePerTape = state->availMem / activeTapes;
+	spacePerTape = MAXALIGN_DOWN(state->availMem / activeTapes);
 	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
 	{
 		if (state->mergeactive[srcTape])
@@ -2880,8 +2953,18 @@ beginmerge(Tuplesortstate *state)
 	}
 
 	/*
-	 * Preread as many tuples as possible (and at least one) from each active
-	 * tape
+	 * Preallocate tuple batch memory for each tape.  This is the memory used
+	 * for tuples themselves (not SortTuples), so it's never used by
+	 * pass-by-value datum sorts.  Memory allocation is performed here at most
+	 * once per sort, just in advance of the final on-the-fly merge step.
+	 */
+	if (finalMergeBatch)
+		mergebatch(state, spacePerTape);
+
+	/*
+	 * Preread as many tuples as possible (and at least one) from each
+	 * active tape.  This will almost certainly use batch memory (it is only
+	 * avoided when batch isn't even big enough for one tuple).
 	 */
 	mergepreread(state);
 
@@ -2890,6 +2973,7 @@ beginmerge(Tuplesortstate *state)
 	{
 		int			tupIndex = state->mergenext[srcTape];
 		SortTuple  *tup;
+		int64		perTapeKB = (spacePerTape + 1023) / 1024;
 
 		if (tupIndex)
 		{
@@ -2902,11 +2986,315 @@ beginmerge(Tuplesortstate *state)
 			tup->tupindex = state->mergefreelist;
 			state->mergefreelist = tupIndex;
 			state->mergeavailslots[srcTape]++;
+
+#ifdef TRACE_SORT
+			/*
+			 * Report if batchmemtuples() had the desired effect of balancing
+			 * memtupsize against the tape batch memory allocation size.  The
+			 * first big preload should be reasonably representative of how
+			 * well each tape is in balance, so this only needs to happen here.
+			 */
+			if (trace_sort && finalMergeBatch)
+			{
+				int64		usedSpaceKB;
+				int			usedSlots;
+
+				usedSpaceKB = (state->mergecurrent[srcTape] -
+							   state->mergetuples[srcTape] + 1023) / 1024;
+				usedSlots = slotsPerTape - state->mergeavailslots[srcTape];
+
+				elog(LOG, "tape %d initially used %ld KB of %ld KB batch "
+					 "(%2.3f) and %d out of %d slots (%2.3f)", srcTape,
+					 usedSpaceKB, perTapeKB,
+					 (double) usedSpaceKB / (double) perTapeKB,
+					 usedSlots, slotsPerTape,
+					 (double) usedSlots / (double) slotsPerTape);
+			}
+#endif
 		}
 	}
 }
 
 /*
+ * batchmemtuples - grow memtuples without palloc overhead
+ *
+ * When called, availMem should be approximately the amount of memory we'd
+ * require to allocate memtupsize - memtupcount tuples (not SortTuples/slots)
+ * that were allocated with palloc() overhead, and in doing so use up all
+ * allocated slots.  However, though slots and tuple memory is in balance
+ * following the last grow_memtuples() call, that's predicated on the observed
+ * average tuple size for the "final" grow_memtuples() call, which includes
+ * palloc overhead.
+ *
+ * This will perform an actual final grow_memtuples() call without any palloc()
+ * overhead, rebalancing the use of memory between slots and tuples.
+ */
+static void
+batchmemtuples(Tuplesortstate *state)
+{
+	int64			refund;
+	int64			availMemLessRefund;
+	int				memtupsize = state->memtupsize;
+
+	/*
+	 * For simplicity, assume no memtuples are actually currently counted.
+	 */
+	Assert(state->memtupcount == 0);
+
+	/*
+	 * Refund STANDARDCHUNKHEADERSIZE per tuple, but then charge
+	 * STANDARDCHUNKHEADERSIZE per activeTape (because each tape has a
+	 * separate tuple buffer allocation).
+	 */
+	refund = memtupsize * STANDARDCHUNKHEADERSIZE;
+	availMemLessRefund = state->availMem - refund;
+
+	/*
+	 * To rebalance, temporarily have our accounting indicate that we've
+	 * allocated all memory we're allowed to, less a refund, and call
+	 * grow_memtuples() to increase the number of slots.
+	 *
+	 * In the future, this should happen after the first run is written,
+	 * when we first have to write out tuples and have some idea of how to
+	 * establish a balance between slots and tuples.  That would allow the
+	 * sorting of most runs to avoid palloc() fragmentation and overhead.
+	 * This requires appropriate support from routines like
+	 * ExecCopySlotMinimalTuple(), that currently insist on doing a
+	 * palloc() of their own.
+	 */
+	state->growmemtuples = true;
+	USEMEM(state, availMemLessRefund);
+	(void) grow_memtuples(state);
+	FREEMEM(state, availMemLessRefund);
+	/* Should not matter, but be tidy */
+	state->growmemtuples = false;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+	{
+		Size	OldKb = (memtupsize * sizeof(SortTuple) + 1023) / 1024;
+		Size	NewKb = (state->memtupsize * sizeof(SortTuple) + 1023) / 1024;
+
+		elog(LOG, "grew memtuples %1.2fx from %d (%zu KB) to %d (%zu KB) for final merge",
+			 (double) NewKb / (double) OldKb,
+			 memtupsize, OldKb,
+			 state->memtupsize, NewKb);
+	}
+#endif
+}
+
+/*
+ * mergebatch - initialize tuple memory in batch
+ *
+ * This allows sequential access to sorted tuples buffered in memory from
+ * tapes/runs on disk during a final on-the-fly merge step.  Note that the
+ * memory is not used for SortTuples, but for the underlying tuples (e.g.
+ * MinimalTuples).
+ *
+ * Note that when batch memory is used, there is a simple division of space
+ * into large buffers (one per active tape).  The conventional incremental
+ * memory accounting (calling USEMEM() and FREEMEM()) is abandoned.  Instead,
+ * when each tape's memory budget is exceeded, a retail palloc() "overflow" is
+ * performed, which is then immediately detected in a way that is analogous to
+ * LACKMEM().  This keeps each tape's use of memory fair, which is always a
+ * goal.
+ */
+static void
+mergebatch(Tuplesortstate *state, int64 spacePerTape)
+{
+	int		srcTape;
+
+	Assert(state->activeTapes > 0);
+	Assert(state->tuples);
+
+	/*
+	 * For the purposes of tuplesort's memory accounting, the batch allocation
+	 * is special, and regular memory accounting through USEMEM() calls is
+	 * abandoned (see mergeprereadone()).
+	 */
+	for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
+	{
+		char	   *mergetuples;
+
+		if (!state->mergeactive[srcTape])
+			continue;
+
+		/* Allocate buffer for each active tape */
+		mergetuples = MemoryContextAllocHuge(state->sortcontext, spacePerTape);
+
+		/* Initialize state for tape */
+		state->mergetuples[srcTape] = mergetuples;
+		state->mergecurrent[srcTape] = mergetuples;
+		state->mergetail[srcTape] = mergetuples;
+		state->mergeoverflow[srcTape] = NULL;
+	}
+
+	state->batchUsed = true;
+	state->spacePerTape = spacePerTape;
+}
+
+/*
+ * mergebatchone - prepare batch memory for one merge input tape
+ *
+ * This is called following the exhaustion of preread tuples for one input
+ * tape.  All that actually occurs is that the state for the source tape is
+ * reset to indicate that all memory may be reused.
+ *
+ * This routine must deal with fixing up the tuple that is about to be returned
+ * to the client, due to "overflow" allocations.
+ */
+static void
+mergebatchone(Tuplesortstate *state, int srcTape, SortTuple *rtup,
+			  bool *should_free)
+{
+	Assert(state->batchUsed);
+
+	/*
+	 * Tuple about to be returned to caller ("stup") is final preread tuple
+	 * from tape, just removed from the top of the heap.  Special steps around
+	 * memory management must be performed for that tuple, to make sure it
+	 * isn't overwritten early.
+	 */
+	if (!state->mergeoverflow[srcTape])
+	{
+		Size	tupLen;
+
+		/*
+		 * Mark tuple proper buffer range for reuse, but be careful to move
+		 * final, tail tuple to start of space for next run so that it's
+		 * available to caller when stup is returned, and remains available
+		 * at least until the next tuple is requested.
+		 */
+		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
+		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
+		memmove(state->mergecurrent[srcTape], state->mergetail[srcTape],
+				tupLen);
+
+		/* Make SortTuple at top of the merge heap point to new tuple */
+		rtup->tuple = (void *) state->mergecurrent[srcTape];
+
+		state->mergetail[srcTape] = state->mergecurrent[srcTape];
+		state->mergecurrent[srcTape] += tupLen;
+	}
+	else
+	{
+		/*
+		 * Handle an "overflow" retail palloc.
+		 *
+		 * This is needed when we run out of tuple memory for the tape.
+		 */
+		state->mergecurrent[srcTape] = state->mergetuples[srcTape];
+		state->mergetail[srcTape] = state->mergetuples[srcTape];
+
+		if (rtup->tuple)
+		{
+			Assert(rtup->tuple == (void *) state->mergeoverflow[srcTape]);
+			/* Caller should free palloc'd tuple */
+			*should_free = true;
+		}
+		state->mergeoverflow[srcTape] = NULL;
+	}
+}
+
+/*
+ * mergebatchfreetape - handle final clean-up for batch memory once tape is
+ * about to become exhausted
+ *
+ * All tuples are returned from tape, but a single final tuple, *rtup, is to be
+ * passed back to caller.  Free tape's batch allocation buffer while ensuring
+ * that the final tuple is managed appropriately.
+ */
+static void
+mergebatchfreetape(Tuplesortstate *state, int srcTape, SortTuple *rtup,
+				   bool *should_free)
+{
+	Assert(state->batchUsed);
+	Assert(state->status == TSS_FINALMERGE);
+
+	/*
+	 * Tuple may or may not already be an overflow allocation from
+	 * mergebatchone()
+	 */
+	if (!*should_free && rtup->tuple)
+	{
+		/*
+		 * Final tuple still in tape's batch allocation.
+		 *
+		 * Return palloc()'d copy to caller, and have it freed in a similar
+		 * manner to overflow allocation.  Otherwise, we'd free batch memory
+		 * and pass back a pointer to garbage.
+		 */
+		Size		tupLen;
+		void	   *oldTuple = rtup->tuple;
+
+		tupLen = state->mergecurrent[srcTape] - state->mergetail[srcTape];
+		rtup->tuple = palloc(tupLen);
+		memcpy(rtup->tuple, oldTuple, tupLen);
+		*should_free = true;
+	}
+
+	/* Free spacePerTape-sized buffer */
+	pfree(state->mergetuples[srcTape]);
+}
+
+/*
+ * mergebatchalloc - allocate memory for one tuple using a batch memory
+ * "logical allocation".
+ *
+ * This is used for the final on-the-fly merge phase only.  READTUP() routines
+ * receive memory from here in place of palloc() and USEMEM() calls.
+ *
+ * Tuple tapenum is passed, ensuring each tape's tuples are stored in sorted,
+ * contiguous order (while allowing safe reuse of memory made available to
+ * each tape).  This maximizes locality of access as tuples are returned by
+ * final merge.
+ *
+ * Caller must not subsequently attempt to free memory returned here.  In
+ * general, only mergebatch* functions know about how memory returned from
+ * here should be freed, and this function's caller must ensure that batch
+ * memory management code will definitely have the opportunity to do the right
+ * thing during the final on-the-fly merge.
+ */
+static void *
+mergebatchalloc(Tuplesortstate *state, int tapenum, Size tuplen)
+{
+	Size		reserve_tuplen = MAXALIGN(tuplen);
+	char	   *ret;
+
+	/* Should overflow at most once before mergebatchone() call: */
+	Assert(state->mergeoverflow[tapenum] == NULL);
+	Assert(state->batchUsed);
+
+	/* It should be possible to use precisely spacePerTape memory at once */
+	if (state->mergecurrent[tapenum] + reserve_tuplen <=
+		state->mergetuples[tapenum] + state->spacePerTape)
+	{
+		/*
+		 * Usual case -- caller is returned pointer into its tape's buffer, and
+		 * an offset from that point is recorded as where tape has consumed up
+		 * to for current round of preloading.
+		 */
+		ret = state->mergetail[tapenum] = state->mergecurrent[tapenum];
+		state->mergecurrent[tapenum] += reserve_tuplen;
+	}
+	else
+	{
+		/*
+		 * palloc(), and record as tape's overflow allocation.  This will be
+		 * detected quickly, in a similar fashion to a LACKMEM() condition,
+		 * and should not happen again before a new round of preloading for
+		 * caller's tape.
+		 *
+		 * Sometimes, this does not happen because merging runs out of slots
+		 * before running out of memory.
+		 */
+		ret = state->mergeoverflow[tapenum] = palloc(tuplen);
+	}
+
+	return ret;
+}
+
+/*
  * mergepreread - load tuples from merge input tapes
  *
  * This routine exists to improve sequentiality of reads during a merge pass,
@@ -2927,7 +3315,9 @@ beginmerge(Tuplesortstate *state)
  * that state and so no point in scanning through all the tapes to fix one.
  * (Moreover, there may be quite a lot of inactive tapes in that state, since
  * we might have had many fewer runs than tapes.  In a regular tape-to-tape
- * merge we can expect most of the tapes to be active.)
+ * merge we can expect most of the tapes to be active.  Plus, only
+ * FINALMERGE state has to consider memory management for a batch
+ * allocation.)
  */
 static void
 mergepreread(Tuplesortstate *state)
@@ -2956,9 +3346,20 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 
 	if (!state->mergeactive[srcTape])
 		return;					/* tape's run is already exhausted */
+
+	/*
+	 * Manage per-tape availMem.  Only actually matters when batch memory not
+	 * in use.
+	 */
 	priorAvail = state->availMem;
 	state->availMem = state->mergeavailmem[srcTape];
-	while ((state->mergeavailslots[srcTape] > 0 && !LACKMEM(state)) ||
+
+	/*
+	 * When batch memory is used if final on-the-fly merge, only mergeoverflow
+	 * test is relevant; otherwise, only LACKMEM() test is relevant.
+	 */
+	while ((state->mergeavailslots[srcTape] > 0 &&
+			state->mergeoverflow[srcTape] == NULL && !LACKMEM(state)) ||
 		   state->mergenext[srcTape] == 0)
 	{
 		/* read next tuple, if any */
@@ -3593,6 +3994,34 @@ markrunend(Tuplesortstate *state, int tapenum)
 	LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
 }
 
+/*
+ * Allocate memory for tuple as part of call to READTUP() routine.
+ *
+ * Memory allocated logically here, in the final on-the-fly merge case, is
+ * reused in a way that is encapsulated in the preread management routines when
+ * a batched allocation of memory is logically consumed from during final
+ * on-the-fly merge phase.  Otherwise, callers must pfree(), and account for
+ * that with a FREEMEM() (currently this only ever needs to happen in
+ * WRITETUP() routines).
+ */
+static void *
+readtup_alloc(Tuplesortstate *state, int tapenum, Size tuplen)
+{
+	if (state->batchUsed)
+	{
+		return mergebatchalloc(state, tapenum, tuplen);
+	}
+	else
+	{
+		char	   *ret;
+
+		/* Batch allocation yet to be performed */
+		ret = palloc(tuplen);
+		USEMEM(state, GetMemoryChunkSpace(ret));
+		return ret;
+	}
+}
+
 
 /*
  * Routines specialized for HeapTuple (actually MinimalTuple) case
@@ -3765,11 +4194,10 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 {
 	unsigned int tupbodylen = len - sizeof(int);
 	unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
-	MinimalTuple tuple = (MinimalTuple) palloc(tuplen);
+	MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tapenum, tuplen);
 	char	   *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
 	HeapTupleData htup;
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* read in the tuple itself */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
@@ -3999,9 +4427,10 @@ readtup_cluster(Tuplesortstate *state, SortTuple *stup,
 				int tapenum, unsigned int tuplen)
 {
 	unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
-	HeapTuple	tuple = (HeapTuple) palloc(t_len + HEAPTUPLESIZE);
+	HeapTuple	tuple = (HeapTuple) readtup_alloc(state,
+												  tapenum,
+												  t_len + HEAPTUPLESIZE);
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	/* Reconstruct the HeapTupleData header */
 	tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
 	tuple->t_len = t_len;
@@ -4301,9 +4730,8 @@ readtup_index(Tuplesortstate *state, SortTuple *stup,
 			  int tapenum, unsigned int len)
 {
 	unsigned int tuplen = len - sizeof(unsigned int);
-	IndexTuple	tuple = (IndexTuple) palloc(tuplen);
+	IndexTuple	tuple = (IndexTuple) readtup_alloc(state, tapenum, tuplen);
 
-	USEMEM(state, GetMemoryChunkSpace(tuple));
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tuple, tuplen);
 	if (state->randomAccess)	/* need trailing length word? */
@@ -4361,7 +4789,7 @@ writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
 		waddr = NULL;
 		tuplen = 0;
 	}
-	else if (state->datumTypeByVal)
+	else if (!state->tuples)
 	{
 		waddr = &stup->datum1;
 		tuplen = sizeof(Datum);
@@ -4403,7 +4831,7 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 		stup->isnull1 = true;
 		stup->tuple = NULL;
 	}
-	else if (state->datumTypeByVal)
+	else if (!state->tuples)
 	{
 		Assert(tuplen == sizeof(Datum));
 		LogicalTapeReadExact(state->tapeset, tapenum,
@@ -4413,14 +4841,13 @@ readtup_datum(Tuplesortstate *state, SortTuple *stup,
 	}
 	else
 	{
-		void	   *raddr = palloc(tuplen);
+		void	   *raddr = readtup_alloc(state, tapenum, tuplen);
 
 		LogicalTapeReadExact(state->tapeset, tapenum,
 							 raddr, tuplen);
 		stup->datum1 = PointerGetDatum(raddr);
 		stup->isnull1 = false;
 		stup->tuple = raddr;
-		USEMEM(state, GetMemoryChunkSpace(raddr));
 	}
 
 	if (state->randomAccess)	/* need trailing length word? */
-- 
1.9.1

From 6f22075b1998a3270d6068ebc521a5c583417e26 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/3] Quicksort when performing external sorts

Add "quicksort with spillover".  This allows an external sort that has a
work_mem setting just short of one that would allow an internal sort to
perform almost as well as an internal sort, quicksorting perhaps almost
all tuples rather than performing a degenerate heapsort.  For these
marginal external sorts, most I/O can be avoided.  Sort performance is
made much more predictable as tables gradually increase in size.

The benefits of replacement selection (the currentRun + nextRun heap)
are seen only where incremental spilling rather than spilling in batches
allows a "quicksort with spillover" to ultimately write almost no tuples
out.  It is not worth the cost in CPU cache misses to use a heap, even
when it can produce runs that are much larger than with the new
approach.  "Quicksort with spillover" is the sole remaining
justification for using a heap for even the first run.

For that reason, tuplesort callers now provide a total row estimate
hint, used to determine if the use of a heap is worthwhile for the first
run (now the only remaining run where it's even possible).  The row
estimate is input into a simple ad-hoc cost model.  If "quicksort with
spillover" is unlikely to save much I/O, or if I/O is unlikely to be a
significant cost overall, it is worth avoiding any use of a heap in the
first place.

As a further optimization, put a quasi-arbitrary upper bound (500 tapes)
on the number of tapes that tuplesort uses, to avoid excessive random
I/O.  Knuth identified 7 tapes as the "sweet spot" for polyphase
merging.  Commit df700e6b sized the number of tapes only according to
available buffer space (the number of tapes was as high as possible
while still allowing at least 32 * BLCKSZ buffer space per tape),
rejecting Knuth's "sweet spot", which helped a lot when the sort thereby
completed in one pass.  However, it's still true today that there are
unlikely to be benefits from increasing the number of tapes from 7 once
the amount of data to be sorted far exceeds available memory.  More
importantly, with very high work_mem settings and even larger data
volumes, the tapes previously wasted as much as 8% of the available
memory budget;  tens of thousands of tapes could be logically allocated
for a sort that will only benefit from a few dozen.  It didn't matter
that the memory wasn't actually allocated, because tuplesort accounting
still charged for the memory for the number of tapes indicated by the
heuristic ahead of merging.
---
 src/backend/access/hash/hash.c         |   2 +-
 src/backend/access/hash/hashsort.c     |   4 +-
 src/backend/access/nbtree/nbtree.c     |  11 +-
 src/backend/access/nbtree/nbtsort.c    |  10 +-
 src/backend/catalog/index.c            |   1 +
 src/backend/commands/cluster.c         |   4 +-
 src/backend/commands/explain.c         |  13 +-
 src/backend/executor/nodeAgg.c         |  26 +-
 src/backend/executor/nodeSort.c        |   1 +
 src/backend/utils/adt/orderedsetaggs.c |  13 +-
 src/backend/utils/sort/tuplesort.c     | 784 +++++++++++++++++++++++++++------
 src/include/access/hash.h              |   3 +-
 src/include/access/nbtree.h            |   2 +-
 src/include/executor/nodeAgg.h         |   2 +
 src/include/utils/tuplesort.h          |  13 +-
 15 files changed, 739 insertions(+), 150 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24b06a5..8f71980 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,7 +86,7 @@ hashbuild(PG_FUNCTION_ARGS)
 	 * one page.
 	 */
 	if (num_buckets >= (uint32) NBuffers)
-		buildstate.spool = _h_spoolinit(heap, index, num_buckets);
+		buildstate.spool = _h_spoolinit(heap, index, num_buckets, reltuples);
 	else
 		buildstate.spool = NULL;
 
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index c67c057..5c7e137 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -44,7 +44,8 @@ struct HSpool
  * create and initialize a spool structure
  */
 HSpool *
-_h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
+_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+			 double reltuples)
 {
 	HSpool	   *hspool = (HSpool *) palloc0(sizeof(HSpool));
 	uint32		hash_mask;
@@ -71,6 +72,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 												   index,
 												   hash_mask,
 												   maintenance_work_mem,
+												   reltuples,
 												   false);
 
 	return hspool;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..0957e0f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "optimizer/plancat.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -85,7 +86,9 @@ btbuild(PG_FUNCTION_ARGS)
 	Relation	index = (Relation) PG_GETARG_POINTER(1);
 	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
 	IndexBuildResult *result;
+	BlockNumber relpages;
 	double		reltuples;
+	double		allvisfrac;
 	BTBuildState buildstate;
 
 	buildstate.isUnique = indexInfo->ii_Unique;
@@ -100,6 +103,9 @@ btbuild(PG_FUNCTION_ARGS)
 		ResetUsage();
 #endif   /* BTREE_BUILD_STATS */
 
+	/* Estimate the number of rows currently present in the table */
+	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
+
 	/*
 	 * We expect to be called exactly once for any index relation. If that's
 	 * not the case, big trouble's what we have.
@@ -108,14 +114,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false,
+									 reltuples);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo->ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, reltuples);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..0d4a5ea 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -149,7 +149,8 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead,
+			  double reltuples)
 {
 	BTSpool    *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
@@ -165,10 +166,15 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * reltuples hint does not account for factors like whether or not this is
+	 * a partial index, or if this is second BTSpool object, because it seems
+	 * more conservative to estimate high.
 	 */
 	btKbytes = isdead ? work_mem : maintenance_work_mem;
 	btspool->sortstate = tuplesort_begin_index_btree(heap, index, isunique,
-													 btKbytes, false);
+													 btKbytes, reltuples,
+													 false);
 
 	return btspool;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index c10be3d..093b98a 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2843,6 +2843,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	state.tuplesort = tuplesort_begin_datum(INT8OID, Int8LessOperator,
 											InvalidOid, false,
 											maintenance_work_mem,
+											ivinfo.num_heap_tuples,
 											false);
 	state.htups = state.itups = state.tups_inserted = 0;
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..23f6459 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -891,7 +891,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	/* Set up sorting if wanted */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
-											maintenance_work_mem, false);
+											maintenance_work_mem,
+											OldHeap->rd_rel->reltuples,
+											false);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 12dae77..980bd45 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2080,20 +2080,27 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 		const char *sortMethod;
 		const char *spaceType;
 		long		spaceUsed;
+		int			rowsSortedMem;
 
-		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed,
+							&rowsSortedMem);
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
-							 sortMethod, spaceType, spaceUsed);
+			appendStringInfo(es->str,
+							 "Sort Method: %s  %s: %ldkB  Rows In Memory: %d\n",
+							 sortMethod,
+							 spaceType,
+							 spaceUsed,
+							 rowsSortedMem);
 		}
 		else
 		{
 			ExplainPropertyText("Sort Method", sortMethod, es);
 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
 			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Rows In Memory", rowsSortedMem, es);
 		}
 	}
 }
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f580cca 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -520,6 +520,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  sortnode->plan.plan_rows,
 												  false);
 	}
 
@@ -588,7 +589,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, false);
+									  work_mem, agg_input_rows(aggstate),
+									  false);
 		else
 			pertrans->sortstates[aggstate->current_set] =
 				tuplesort_begin_heap(pertrans->evaldesc,
@@ -597,7 +599,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, agg_input_rows(aggstate),
+									 false);
 	}
 
 	/*
@@ -1439,6 +1442,25 @@ find_hash_columns(AggState *aggstate)
 }
 
 /*
+ * Estimate the number of rows input to the sorter.
+ *
+ * Exported for use by ordered-set aggregates.
+ */
+double
+agg_input_rows(AggState *aggstate)
+{
+	Plan	   *outerNode;
+
+	/*
+	 * Get information about the size of the relation to be sorted (it's the
+	 * "outer" subtree of this node)
+	 */
+	outerNode = outerPlanState(aggstate)->plan;
+
+	return outerNode->plan_rows;
+}
+
+/*
  * Estimate per-hash-table-entry overhead for the planner.
  *
  * Note that the estimate does not include space for pass-by-reference
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..e4b1104 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,6 +89,7 @@ ExecSort(SortState *node)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
+											  plannode->plan.plan_rows,
 											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..b51a945 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -20,6 +20,7 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
+#include "executor/nodeAgg.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/tlist.h"
@@ -103,6 +104,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 {
 	OSAPerGroupState *osastate;
 	OSAPerQueryState *qstate;
+	AggState		 *aggstate;
 	MemoryContext gcontext;
 	MemoryContext oldcontext;
 
@@ -117,8 +119,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	/*
 	 * We keep a link to the per-query state in fn_extra; if it's not there,
 	 * create it, and do the per-query setup we need.
+	 *
+	 * aggstate is used to get hint on total number of tuples for tuplesort.
 	 */
 	qstate = (OSAPerQueryState *) fcinfo->flinfo->fn_extra;
+	aggstate = (AggState *) fcinfo->context;
 	if (qstate == NULL)
 	{
 		Aggref	   *aggref;
@@ -276,13 +281,17 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortOperators,
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
-												   work_mem, false);
+												   work_mem,
+												   agg_input_rows(aggstate),
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
 													qstate->sortCollation,
 													qstate->sortNullsFirst,
-													work_mem, false);
+													work_mem,
+													agg_input_rows(aggstate),
+													false);
 
 	osastate->number_of_rows = 0;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index cf1cdcb..4d39403 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -8,17 +8,20 @@
  * if necessary).  It works efficiently for both small and large amounts
  * of data.  Small amounts are sorted in-memory using qsort().  Large
  * amounts are sorted using temporary files and a standard external sort
- * algorithm.
+ * algorithm, with an enhancement that smooths the transition from internal
+ * to external sorting.
  *
  * See Knuth, volume 3, for more than you want to know about the external
- * sorting algorithm.  We divide the input into sorted runs using replacement
- * selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * sorting algorithm.  Historically, we divided the input into sorted runs
+ * using replacement selection, in the form of a priority tree implemented
+ * as a heap (essentially his Algorithm 5.2.3H -- although that strategy is
+ * often avoided altogether), but there are now numerous special case
+ * optimizations.  We still merge the runs using polyphase merge, Knuth's
+ * Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D are
+ * implemented by logtape.c, which avoids space wastage by recycling disk
+ * space as soon as each block is read from its "tape".
  *
- * We do not form the initial runs using Knuth's recommended replacement
+ * We never form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
  * number of records in memory at all times.  Since we are dealing with
  * tuples that may vary considerably in size, we want to be able to vary
@@ -28,7 +31,25 @@
  * Algorithm 5.4.1R, each record is stored with the run number that it
  * must go into, and we use (run number, key) as the ordering key for the
  * heap.  When the run number at the top of the heap changes, we know that
- * no more records of the prior run are left in the heap.
+ * no more records of the prior run are left in the heap.  Note that there
+ * are in practice only ever two distinct run numbers, due to the greatly
+ * reduced use of replacement selection starting in PostgreSQL 9.6.
+ *
+ * Starting with PostgreSQL 9.6, a heap (based on Knuth's Algorithm H, with
+ * some small customizations) is only used when incremental spilling can
+ * enable a "quicksort with spillover" to avoid most I/O (which is why
+ * there are now only two valid run numbers possible in practice, and why
+ * heapification is customized to not distinguish between tuples in the
+ * second heap-wise run).  We prefer a simple hybrid sort-merge strategy
+ * most of the time, where runs are sorted in much the same way as the
+ * entire input of an internal sort is sorted.  There are several reasons
+ * for this.  Maintaining a priority tree/heap has poor cache
+ * characteristics.  At the same time, the growth in main memory sizes has
+ * greatly diminished the value of having runs that are larger than
+ * workMem, even in the case where there is partially sorted input and runs
+ * can be made far larger by using a heap; we'll tend to still get a
+ * single-pass merge step in the end.  Even when we don't, the additional
+ * merge passes tend to be worth it.
  *
  * The approximate amount of memory allowed for any one sort operation
  * is specified in kilobytes by the caller (most pass work_mem).  Initially,
@@ -36,13 +57,11 @@
  * we haven't exceeded workMem.  If we reach the end of the input without
  * exceeding workMem, we sort the array using qsort() and subsequently return
  * tuples just by scanning the tuple array sequentially.  If we do exceed
- * workMem, we construct a heap using Algorithm H and begin to emit tuples
- * into sorted runs in temporary tapes, emitting just enough tuples at each
- * step to get back within the workMem limit.  Whenever the run number at
- * the top of the heap changes, we begin a new run with a new output tape
- * (selected per Algorithm D).  After the end of the input is reached,
- * we dump out remaining tuples in memory into a final run (or two),
- * then merge the runs using Algorithm D.
+ * workMem, we begin to emit tuples into sorted runs in temporary tapes.
+ * When tuples are dumped in batch after quicksorting, we begin a new run
+ * with a new output tape (selected per Algorithm D).  After the end of the
+ * input is reached, we dump out remaining tuples in memory into a final run
+ * (or two), then merge the runs using Algorithm D.
  *
  * When merging runs, we use a heap containing just the frontmost tuple from
  * each source run; we repeatedly output the smallest tuple and insert the
@@ -84,7 +103,9 @@
  * code we determine the number of tapes M on the basis of workMem: we want
  * workMem/M to be large enough that we read a fair amount of data each time
  * we preread from a tape, so as to maintain the locality of access described
- * above.  Nonetheless, with large workMem we can have many tapes.
+ * above.  Nonetheless, with large workMem we can have a few hundred tapes.
+ * Some attempt is made to limit the number of tapes, so as to not pay an
+ * excessive cost in random I/O.
  *
  *
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
@@ -135,7 +156,7 @@ bool		optimize_bounded_sort = true;
 
 /*
  * The objects we actually sort are SortTuple structs.  These contain
- * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
+ * a pointer to the tuple itself (might be a MinimalTuple or IndexTuple),
  * which is a separate palloc chunk --- we assume it is just one chunk and
  * can be freed by a simple pfree().  SortTuples also contain the tuple's
  * first key column in Datum/nullflag format, and an index integer.
@@ -160,15 +181,19 @@ bool		optimize_bounded_sort = true;
  * described above.  Accordingly, "tuple" is always used in preference to
  * datum1 as the authoritative value for pass-by-reference cases.
  *
- * While building initial runs, tupindex holds the tuple's run number.  During
- * merge passes, we re-use it to hold the input tape number that each tuple in
- * the heap was read from, or to hold the index of the next tuple pre-read
- * from the same tape in the case of pre-read entries.  tupindex goes unused
- * if the sort occurs entirely in memory.
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * currently it only meaningfully distinguishes the first run with any other
+ * run (in practice this is limited to the second run), since replacement
+ * selection is abandoned after the first run.  During merge passes, we re-use
+ * it to hold the input tape number that each tuple in the heap was read from,
+ * or to hold the index of the next tuple pre-read from the same tape in the
+ * case of pre-read entries.  tupindex goes unused if the sort occurs entirely
+ * in memory.
  */
 typedef struct
 {
-	void	   *tuple;			/* the tuple proper */
+	void	   *tuple;			/* the tuple itself */
 	Datum		datum1;			/* value of first key column */
 	bool		isnull1;		/* is first key column NULL? */
 	int			tupindex;		/* see notes above */
@@ -186,6 +211,7 @@ typedef enum
 	TSS_BUILDRUNS,				/* Loading tuples; writing to tape */
 	TSS_SORTEDINMEM,			/* Sort completed entirely in memory */
 	TSS_SORTEDONTAPE,			/* Sort completed, final run is on tape */
+	TSS_MEMTAPEMERGE,			/* Performing memory/tape merge on-the-fly */
 	TSS_FINALMERGE				/* Performing final merge on-the-fly */
 } TupSortStatus;
 
@@ -201,6 +227,7 @@ typedef enum
  * tape during a preread cycle (see discussion at top of file).
  */
 #define MINORDER		6		/* minimum merge order */
+#define MAXORDER		500		/* maximum merge order */
 #define TAPE_BUFFER_OVERHEAD		(BLCKSZ * 3)
 #define MERGE_BUFFER_SIZE			(BLCKSZ * 32)
 
@@ -214,6 +241,7 @@ struct Tuplesortstate
 {
 	TupSortStatus status;		/* enumerated value as shown above */
 	int			nKeys;			/* number of columns in sort key */
+	double		rowCount;		/* caller's hint of total # of rows */
 	bool		randomAccess;	/* did caller request random access? */
 	bool		bounded;		/* did caller specify a maximum number of
 								 * tuples to return? */
@@ -280,6 +308,13 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
+	 * While building initial runs, this indicates if the replacement
+	 * selection strategy is in use.  When it isn't, then a simple hybrid
+	 * sort-merge strategy is in use instead.
+	 */
+	bool		replaceActive;
+
+	/*
 	 * While building initial runs, this is the current output run number
 	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
 	 */
@@ -327,12 +362,22 @@ struct Tuplesortstate
 	int			activeTapes;	/* # of active input tapes in merge pass */
 
 	/*
+	 * These variables are used after tuplesort_performsort() for the
+	 * TSS_MEMTAPEMERGE case.  This is a special, optimized final on-the-fly
+	 * merge pass involving merging the result tape with memtuples that were
+	 * quicksorted (but never made it out to a tape).
+	 */
+	SortTuple	tape_cache;		/* cached tape tuple from prior call */
+	bool		cached;			/* tape_cache holds pending tape tuple */
+	bool		just_memtuples;	/* merge only fetching from memtuples */
+
+	/*
 	 * These variables are used after completion of sorting to keep track of
 	 * the next tuple to return.  (In the tape case, the tape's current read
 	 * position is also critical state.)
 	 */
 	int			result_tape;	/* actual tape number of finished output */
-	int			current;		/* array index (only used if SORTEDINMEM) */
+	int			current;		/* memtuples array index */
 	bool		eof_reached;	/* reached EOF (needed for cursors) */
 
 	/* markpos_xxx holds marked position for mark and restore */
@@ -457,19 +502,24 @@ struct Tuplesortstate
 	} while(0)
 
 
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, double rowCount,
+											  bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
+static void mergememruns(Tuplesortstate *state);
 static void beginmerge(Tuplesortstate *state);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
 static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_sort_memtuples(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 					  int tupleindex, bool checkIndex);
 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -533,12 +583,13 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  * Each variant of tuplesort_begin has a workMem parameter specifying the
  * maximum number of kilobytes of RAM to use before spilling data to disk.
  * (The normal value of this parameter is work_mem, but some callers use
- * other values.)  Each variant also has a randomAccess parameter specifying
+ * other values.)  Each variant also has a hint parameter of the total
+ * number of rows to be sorted, and a randomAccess parameter specifying
  * whether the caller needs non-sequential access to the sort result.
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, double rowCount, bool randomAccess)
 {
 	Tuplesortstate *state;
 	MemoryContext sortcontext;
@@ -568,6 +619,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 #endif
 
 	state->status = TSS_INITIAL;
+	state->rowCount = rowCount;
 	state->randomAccess = randomAccess;
 	state->bounded = false;
 	state->boundUsed = false;
@@ -613,9 +665,10 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, double rowCount, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowCount,
+												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
@@ -683,9 +736,10 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
-						int workMem, bool randomAccess)
+						int workMem, double rowCount, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowCount,
+												   randomAccess);
 	ScanKey		indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -776,9 +830,10 @@ Tuplesortstate *
 tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
-							int workMem, bool randomAccess)
+							int workMem, double rowCount, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowCount,
+												   randomAccess);
 	ScanKey		indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -851,9 +906,10 @@ Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
 						   uint32 hash_mask,
-						   int workMem, bool randomAccess)
+						   int workMem, double rowCount, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowCount,
+												   randomAccess);
 	MemoryContext oldcontext;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -886,9 +942,10 @@ tuplesort_begin_index_hash(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess)
+					  int workMem, double rowCount, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowCount,
+												   randomAccess);
 	MemoryContext oldcontext;
 	int16		typlen;
 	bool		typbyval;
@@ -1497,22 +1554,61 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 
 			/*
 			 * Insert the tuple into the heap, with run number currentRun if
-			 * it can go into the current run, else run number currentRun+1.
-			 * The tuple can go into the current run if it is >= the first
-			 * not-yet-output tuple.  (Actually, it could go into the current
-			 * run if it is >= the most recently output tuple ... but that
-			 * would require keeping around the tuple we last output, and it's
-			 * simplest to let writetup free each tuple as soon as it's
-			 * written.)
+			 * it can go into the current run, else run number INT_MAX (some
+			 * later run).  The tuple can go into the current run if it is
+			 * >= the first not-yet-output tuple.  (Actually, it could go
+			 * into the current run if it is >= the most recently output
+			 * tuple ... but that would require keeping around the tuple we
+			 * last output, and it's simplest to let writetup free each
+			 * tuple as soon as it's written.)
 			 *
-			 * Note there will always be at least one tuple in the heap at
-			 * this point; see dumptuples.
+			 * Note that this only applies if the currentRun is 0 (prior to
+			 * giving up on heapification).  There is no meaningful
+			 * distinction between any two runs in memory except the first
+			 * and second run.  When the currentRun is not 0, there is no
+			 * guarantee that any tuples are already stored in memory here,
+			 * and if there are any they're in no significant order.
 			 */
-			Assert(state->memtupcount > 0);
-			if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			Assert(!state->replaceActive || state->memtupcount > 0);
+			if (state->replaceActive &&
+				COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			{
+				/*
+				 * Unlike classic replacement selection, which this module was
+				 * previously based on, only run 0 is treated as a priority
+				 * queue through heapification.  The second run (run 1) is
+				 * appended indifferently below, and will never be trusted to
+				 * maintain the heap invariant beyond simply not getting in
+				 * the way of spilling run 0 incrementally.  In other words,
+				 * second run tuples may be sifted out of the way of first
+				 * run tuples; COMPARETUP() will never be called for run
+				 * 1 tuples.  However, not even HEAPCOMPARE() will be
+				 * called for a subsequent run's tuples.
+				 */
 				tuplesort_heap_insert(state, tuple, state->currentRun, true);
+			}
 			else
-				tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+			{
+				/*
+				 * Note that unlike Knuth, we do not care about the second
+				 * run's tuples when loading runs.  After the first run is
+				 * complete, tuples will not be dumped incrementally at all,
+				 * but as long as the first run (run 0) is current it will
+				 * be maintained.  dumptuples does not trust that the second
+				 * or subsequent runs are heapified (beyond merely not
+				 * getting in the way of the first, fully heapified run,
+				 * which only matters for the second run, run 1).  Anything
+				 * past the first run will be quicksorted.
+				 *
+				 * Past the first run, there is no need to differentiate runs
+				 * in memory (only the first and second runs will ever be
+				 * usefully differentiated).  Use a generic INT_MAX run
+				 * number (just to be tidy).  There should always be room to
+				 * store the incoming tuple.
+				 */
+				tuple->tupindex = INT_MAX;
+				state->memtuples[state->memtupcount++] = *tuple;
+			}
 
 			/*
 			 * If we are over the memory limit, dump tuples till we're under.
@@ -1587,20 +1683,9 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * We were able to accumulate all the tuples within the allowed
-			 * amount of memory.  Just qsort 'em and we're done.
+			 * amount of memory.  Just quicksort 'em and we're done.
 			 */
-			if (state->memtupcount > 1)
-			{
-				/* Can we use the single-key sort function? */
-				if (state->onlyKey != NULL)
-					qsort_ssup(state->memtuples, state->memtupcount,
-							   state->onlyKey);
-				else
-					qsort_tuple(state->memtuples,
-								state->memtupcount,
-								state->comparetup,
-								state);
-			}
+			tuplesort_sort_memtuples(state);
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -1627,12 +1712,27 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * Finish tape-based sort.  First, flush all tuples remaining in
-			 * memory out to tape; then merge until we have a single remaining
-			 * run (or, if !randomAccess, one run per tape). Note that
-			 * mergeruns sets the correct state->status.
+			 * memory out to tape where that's required (when more than one
+			 * run's tuples made it to tape, or when the caller required
+			 * random access).  Then, either merge until we have a single
+			 * remaining run on tape, or merge runs in memory by sorting
+			 * them into one single in-memory run.  Note that
+			 * mergeruns/mergememruns sets the correct state->status.
 			 */
-			dumptuples(state, true);
-			mergeruns(state);
+			if (state->currentRun > 0 || state->randomAccess)
+			{
+				dumptuples(state, true);
+				mergeruns(state);
+			}
+			else
+			{
+				/*
+				 * "Quicksort with spillover".  Only possible for
+				 * !randomAccess callers, just as with tape based on-the-fly
+				 * merge.
+				 */
+				mergememruns(state);
+			}
 			state->eof_reached = false;
 			state->markpos_block = 0L;
 			state->markpos_offset = 0;
@@ -1651,6 +1751,9 @@ tuplesort_performsort(Tuplesortstate *state)
 			elog(LOG, "performsort done (except %d-way final merge): %s",
 				 state->activeTapes,
 				 pg_rusage_show(&state->ru_start));
+		else if (state->status == TSS_MEMTAPEMERGE)
+			elog(LOG, "performsort done (except memory/tape final merge): %s",
+				 pg_rusage_show(&state->ru_start));
 		else
 			elog(LOG, "performsort done: %s",
 				 pg_rusage_show(&state->ru_start));
@@ -1802,6 +1905,104 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			READTUP(state, stup, state->result_tape, tuplen);
 			return true;
 
+		case TSS_MEMTAPEMERGE:
+			Assert(forward);
+			/* For now, assume tuple returned from memory */
+			*should_free = false;
+
+			/*
+			 * Should be at least one memtuple (work_mem should be roughly
+			 * fully consumed)
+			 */
+			Assert(state->memtupcount > 0);
+
+			if (state->eof_reached)
+				return false;
+
+			if (state->just_memtuples)
+				goto just_memtuples;
+
+			/*
+			 * Merge together quicksorted memtuples array run (which often
+			 * consists of what were originally 2 "heap-wise" runs), and
+			 * tuples sorted on tape (which was originally the head of the
+			 * first heap-wise run).
+			 *
+			 * Exhaust the supply of tape tuples first.
+			 *
+			 * "stup" is always initially set to the current tape tuple if
+			 * any remain, which may be cached from previous call, or read
+			 * from tape when nothing cached.
+			 */
+			if (state->cached)
+				*stup = state->tape_cache;
+			else if ((tuplen = getlen(state, state->result_tape, true)) != 0)
+				READTUP(state, stup, state->result_tape, tuplen);
+			else
+			{
+				/* Supply of tape tuples was just exhausted */
+				state->just_memtuples = true;
+				goto just_memtuples;
+			}
+
+			/*
+			 * Kludge:  Trigger abbreviated tie-breaker if in-memory tuples
+			 * use abbreviation (writing tuples to tape never preserves
+			 * abbreviated keys).  Do this by assigning in-memory
+			 * abbreviated tuple to tape tuple directly.
+			 *
+			 * It doesn't seem worth generating a new abbreviated key for
+			 * the tape tuple, and this approach is simpler than
+			 * "unabbreviating" the memtuple tuple from a "common" routine
+			 * like this.
+			 */
+			if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+				stup->datum1 = state->memtuples[state->current].datum1;
+
+			/*
+			 * Compare current tape tuple to current memtuple.
+			 *
+			 * Since we always start with at least one memtuple, and since tape
+			 * tuples are always returned before equal memtuples, it follows
+			 * that there must be at least one memtuple left to return here.
+			 */
+			Assert(state->current < state->memtupcount);
+
+			if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
+			{
+				/*
+				 * Tape tuple less than or equal to memtuple array current
+				 * position.  Return it.
+				 */
+				state->cached = false;
+				/* Caller can free tape tuple memory */
+				*should_free = true;
+			}
+			else
+			{
+				/*
+				 * Tape tuple greater than memtuple array's current tuple.
+				 *
+				 * Return current memtuple tuple, and cache tape tuple for
+				 * next call.  It will be returned on next or subsequent
+				 * call.
+				 */
+				state->tape_cache = *stup;
+				state->cached = true;
+				*stup = state->memtuples[state->current++];
+			}
+			return true;
+
+just_memtuples:
+			/* Just return memtuples -- merging done */
+			if (state->current < state->memtupcount)
+			{
+				*stup = state->memtuples[state->current++];
+				return true;
+			}
+			state->eof_reached = true;
+			return false;
+
 		case TSS_FINALMERGE:
 			Assert(forward);
 			*should_free = true;
@@ -2011,6 +2212,7 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
 			return false;
 
 		case TSS_SORTEDONTAPE:
+		case TSS_MEMTAPEMERGE:
 		case TSS_FINALMERGE:
 
 			/*
@@ -2066,13 +2268,108 @@ tuplesort_merge_order(int64 allowedMem)
 	mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) /
 		(MERGE_BUFFER_SIZE + TAPE_BUFFER_OVERHEAD);
 
-	/* Even in minimum memory, use at least a MINORDER merge */
+	/*
+	 * Even in minimum memory, use at least a MINORDER merge.  Also cap the
+	 * maximum merge order to MAXORDER.
+	 *
+	 * When allowedMem is drastically lower than what is required for an
+	 * internal sort, it is unlikely that there are benefits to increasing the
+	 * number of tapes beyond Knuth's "sweet spot" of 7, especially with
+	 * modern hardware that is typically memory bound, which is one reason for
+	 * the upper bound.  Another reason is that in the common case where there
+	 * are at most a few hundred runs, the overhead per-tape if we were not to
+	 * cap could be significant with high allowedMem.
+	 */
 	mOrder = Max(mOrder, MINORDER);
+	mOrder = Min(mOrder, MAXORDER);
 
 	return mOrder;
 }
 
 /*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ *
+ * This is called when we just ran out of memory, and must consider costs
+ * and benefits of replacement selection for first run, which can result in
+ * a "quicksort with spillover".  Note that replacement selection is always
+ * abandoned after the first run.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+	int64		memNowUsed = state->allowedMem - state->availMem;
+	double		avgTupleSize;
+	int			increments;
+	double		crossover;
+	bool		useSelection;
+
+	/* For randomAccess callers, "quicksort with spillover" is never used */
+	if (state->randomAccess)
+		return false;
+
+	/*
+	 * Crossover point is somewhere between where memtuples is between 40%
+	 * and all-but-one of total tuples to sort.  This weighs approximate
+	 * savings in I/O, against generic heap sorting cost.
+	 */
+	avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+	/*
+	 * Starting from a threshold of 90%, refund 7.5% per 32 byte
+	 * average-size-increment.
+	 */
+	increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+	crossover = 0.90 - (increments * 0.075);
+
+	/*
+	 * Clamp, making either outcome possible regardless of average size.
+	 *
+	 * 40% is about the minimum point at which "quicksort with spillover"
+	 * can still occur without a logical/physical correlation.
+	 */
+	crossover = Max(0.40, Min(crossover, 0.85));
+
+	/*
+	 * The point where the overhead of maintaining the heap invariant is
+	 * likely to dominate over any saving in I/O is somewhat arbitrarily
+	 * assumed to be the point where memtuples' size exceeds MaxAllocSize
+	 * (note that overall memory consumption may be far greater).  Past
+	 * this point, only the most compelling cases use replacement selection
+	 * for their first run.
+	 *
+	 * This is not about cache characteristics so much as the O(n log n)
+	 * cost of sorting larger runs dominating over the O(n) cost of
+	 * writing/reading tuples.
+	 */
+	if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+		crossover = avgTupleSize > 32 ? 0.90 : 0.95;
+
+	useSelection = state->memtupcount > state->rowCount * crossover;
+
+	/*
+	 * Sanity check:  If the hint is less than half of the estimated total
+	 * number of input tuples, the estimate is clearly bogus.  It's
+	 * probably generic.  Do not proceed.
+	 */
+	if (state->memtupcount < state->rowCount * 0.5)
+		useSelection = false;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG,
+			 "%s used at row %d crossover %.3f (est %.2f rows %.2f runs)",
+			 useSelection ?
+			 "replacement selection (quicksort with spillover) strategy" :
+			 "hybrid sort-merge strategy",
+			 state->memtupcount, crossover, state->rowCount,
+			 state->rowCount / state->memtupcount);
+#endif
+
+	return useSelection;
+}
+
+/*
  * inittapes - initialize for tape sorting.
  *
  * This is called only if we have found we don't have room to sort in memory.
@@ -2081,7 +2378,6 @@ static void
 inittapes(Tuplesortstate *state)
 {
 	int			maxTapes,
-				ntuples,
 				j;
 	int64		tapeSpace;
 
@@ -2140,23 +2436,38 @@ inittapes(Tuplesortstate *state)
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
 
 	/*
-	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
-	 * marked as belonging to run number zero.
-	 *
-	 * NOTE: we pass false for checkIndex since there's no point in comparing
-	 * indexes in this step, even though we do intend the indexes to be part
-	 * of the sort key...
+	 * Give replacement selection a try when number of tuples to be sorted
+	 * has a reasonable chance of enabling a "quicksort with spillover".
+	 * There will be a switch to a simple hybrid sort-merge strategy after
+	 * the first run (iff there is to be a second on-tape run).
 	 */
-	ntuples = state->memtupcount;
-	state->memtupcount = 0;		/* make the heap empty */
-	for (j = 0; j < ntuples; j++)
+	state->replaceActive = useselection(state);
+	state->cached = false;
+	state->just_memtuples = false;
+
+	if (state->replaceActive)
 	{
-		/* Must copy source tuple to avoid possible overwrite */
-		SortTuple	stup = state->memtuples[j];
+		/*
+		 * Convert the unsorted contents of memtuples[] into a heap. Each
+		 * tuple is marked as belonging to run number zero.
+		 *
+		 * NOTE: we pass false for checkIndex since there's no point in
+		 * comparing indexes in this step, even though we do intend the
+		 * indexes to be part of the sort key...
+		 */
+		int			ntuples = state->memtupcount;
 
-		tuplesort_heap_insert(state, &stup, 0, false);
+		state->memtupcount = 0;		/* make the heap empty */
+
+		for (j = 0; j < ntuples; j++)
+		{
+			/* Must copy source tuple to avoid possible overwrite */
+			SortTuple	stup = state->memtuples[j];
+
+			tuplesort_heap_insert(state, &stup, 0, false);
+		}
+		Assert(state->memtupcount == ntuples);
 	}
-	Assert(state->memtupcount == ntuples);
 
 	state->currentRun = 0;
 
@@ -2231,21 +2542,6 @@ mergeruns(Tuplesortstate *state)
 	Assert(state->status == TSS_BUILDRUNS);
 	Assert(state->memtupcount == 0);
 
-	/*
-	 * If we produced only one initial run (quite likely if the total data
-	 * volume is between 1X and 2X workMem), we can just use that tape as the
-	 * finished output, rather than doing a useless merge.  (This obvious
-	 * optimization is not in Knuth's algorithm.)
-	 */
-	if (state->currentRun == 1)
-	{
-		state->result_tape = state->tp_tapenum[state->destTape];
-		/* must freeze and rewind the finished output tape */
-		LogicalTapeFreeze(state->tapeset, state->result_tape);
-		state->status = TSS_SORTEDONTAPE;
-		return;
-	}
-
 	if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
 	{
 		/*
@@ -2437,6 +2733,89 @@ mergeonerun(Tuplesortstate *state)
 }
 
 /*
+ * mergememruns -- merge runs in memory into a new in-memory run.
+ *
+ * This allows tuplesort to avoid dumping many tuples in the common case
+ * where workMem is only somewhat less than the amount that would allow an
+ * internal sort to proceed.  This technique is called "quicksort with
+ * spillover", and uses a heap (this is the sole reason why this module
+ * continues to support a heap that distinguishes tuples by run number, in
+ * the style of replacement selection).
+ *
+ * The useselection() cost model aims to limit the use of the technique to
+ * cases where it allows most I/O to be avoided.  Prior to PostgreSQL
+ * 9.6, replacement selection could sometimes produce runs that are far
+ * larger than workMem in size, but currently the goal of incremental
+ * heap spilling is to avoid I/O (i.e. The heap enables the avoidance of
+ * spilling the majority of tuples entirely).  The traditional goal of
+ * replacement selection is to maximize the size of every run, which is
+ * a significant difference.
+ *
+ * Merging here actually means quicksorting all memtuples, without regard
+ * to the "heap-wise" run number of each memtuple.  Note that this
+ * in-memory merge is distinct from the final on-the-fly merge step that
+ * will follow.  This routine merges what was originally the tail of the
+ * first run with what was originally the entire second run in advance of
+ * the on-the-fly merge step.  We only quicksort;  there doesn't seem to
+ * be much potential to further exploit the fact that memtuples is
+ * heapified.
+ *
+ * The final on-the-fly merge occurs between the new all-in-memory run
+ * produced by this routine, and what was originally the first part of the
+ * first heap-wise run (and will shortly be considered simply the first
+ * run), which is all on tape already.  Incremental spilling from the heap
+ * left all tuples on tape in sorted order.
+ */
+static void
+mergememruns(Tuplesortstate *state)
+{
+	Assert(state->replaceActive);
+	Assert(!state->randomAccess);
+
+	/*
+	 * Write an end-of-run marker on the output tape, since the tail of what
+	 * was conceptually the same heap-wise run (run 0) is now to be merged
+	 * with a possible second heap-wise run, to make a new in-memory run.
+	 */
+	markrunend(state, state->currentRun);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "starting quicksort of %d tuples to create single in-memory run: %s",
+			 state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+	/*
+	 * Quicksort memtuples to merge (part of) heap-wise run 0, the first
+	 * heap-wise run, with (entire) heap-wise run 1, the second.
+	 */
+	tuplesort_sort_memtuples(state);
+	state->current = 0;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksorting %d tuples to create single in-memory run: %s",
+			 state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+	/*
+	 * Note:  If there was only ever a single heap-wise run, which is
+	 * likely in the event of partially sorted input, we don't do anything
+	 * special.  In particular, we currently don't bother to avoid *all*
+	 * comparisons during the final on-the-fly merge (when we're still
+	 * returning tuples from tape, the lowest/first in-memory tuple will
+	 * be compared against every tape tuple).  In the future, avoiding
+	 * these comparisons might be worthwhile for certain cases
+	 * (non-overlapping ranges within the two runs to be merged should
+	 * still make things fast, though).
+	 */
+	state->result_tape = state->tp_tapenum[state->destTape];
+	/* Must freeze and rewind the finished output tape */
+	LogicalTapeFreeze(state->tapeset, state->result_tape);
+	state->status = TSS_MEMTAPEMERGE;
+}
+
+/*
  * beginmerge - initialize for a merge pass
  *
  * We decrease the counts of real and dummy runs for each tape, and mark
@@ -2615,21 +2994,23 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 }
 
 /*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
  *
  * This is used during initial-run building, but not during merging.
  *
- * When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to).  We also
- * insist there be at least one free slot in the memtuples[] array.
+ * When alltuples = false and replacement selection is still active, dump
+ * only enough tuples to get under the availMem limit (and leave at least
+ * one tuple in memtuples, since puttuple will then assume it is a heap that
+ * has a tuple to compare to).  We also insist there be at least one free
+ * slot in the memtuples[] array.
  *
- * When alltuples = true, dump everything currently in memory.
+ * When alltuples = true, always batch dump everything currently in memory.
  * (This case is only used at end of input data.)
  *
- * If we empty the heap, close out the current run and return (this should
- * only happen at end of input data).  If we see that the tuple run number
- * at the top of the heap has changed, start a new run.
+ * If, when replacement selection is active, we see that the tuple run
+ * number at the top of the heap has changed, start a new run.  This must be
+ * the first run, because replacement selection is always abandoned for all
+ * further runs.
  */
 static void
 dumptuples(Tuplesortstate *state, bool alltuples)
@@ -2638,22 +3019,40 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		   (LACKMEM(state) && state->memtupcount > 1) ||
 		   state->memtupcount >= state->memtupsize)
 	{
+		if (state->replaceActive && !alltuples)
+		{
+			/*
+			 * Still holding out for a case favorable to replacement selection.
+			 * Still incrementally spilling using heap.
+			 *
+			 * Dump the heap's frontmost entry, and sift up to remove it from
+			 * the heap.
+			 */
+			Assert(state->memtupcount > 1);
+			WRITETUP(state, state->tp_tapenum[state->destTape],
+					 &state->memtuples[0]);
+			tuplesort_heap_siftup(state, true);
+		}
+		else
+		{
+			/*
+			 * Once committed to quicksorting runs, never incrementally
+			 * spill
+			 */
+			dumpbatch(state, alltuples);
+			break;
+		}
+
 		/*
-		 * Dump the heap's frontmost entry, and sift up to remove it from the
-		 * heap.
+		 * If top run number has changed, we've finished the current run
+		 * (this can only be the first run, run 0), and will no longer spill
+		 * incrementally.
 		 */
 		Assert(state->memtupcount > 0);
-		WRITETUP(state, state->tp_tapenum[state->destTape],
-				 &state->memtuples[0]);
-		tuplesort_heap_siftup(state, true);
-
-		/*
-		 * If the heap is empty *or* top run number has changed, we've
-		 * finished the current run.
-		 */
-		if (state->memtupcount == 0 ||
-			state->currentRun != state->memtuples[0].tupindex)
+		if (state->memtuples[0].tupindex != 0)
 		{
+			Assert(state->memtuples[0].tupindex == INT_MAX);
+
 			markrunend(state, state->tp_tapenum[state->destTape]);
 			state->currentRun++;
 			state->tp_runs[state->destTape]++;
@@ -2661,24 +3060,93 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
 			if (trace_sort)
-				elog(LOG, "finished writing%s run %d to tape %d: %s",
-					 (state->memtupcount == 0) ? " final" : "",
+				elog(LOG, "finished writing heapsorted run %d to tape %d: %s",
 					 state->currentRun, state->destTape,
 					 pg_rusage_show(&state->ru_start));
 #endif
 
 			/*
-			 * Done if heap is empty, else prepare for new run.
+			 * Heap unlikely to be empty.  Prepare for new run, and give up on
+			 * replacement selection.
+			 *
+			 * This outcome is probably due to a caller hint indicating
+			 * incorrectly that the end of input tuples was near when we first
+			 * LACKMEM()'d.  This is not desirable, since a "quicksort with
+			 * spillover" will not occur.
 			 */
-			if (state->memtupcount == 0)
-				break;
-			Assert(state->currentRun == state->memtuples[0].tupindex);
 			selectnewtape(state);
+			state->replaceActive = false;
 		}
 	}
 }
 
 /*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Second or subsequent runs are never heapified by this module (although
+ * heapification still respects run number differences between the first and
+ * second runs), and a heap (replacement selection priority queue) is often
+ * avoided in the first place.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+	int			memtupwrite;
+	int			i;
+
+	Assert(state->status == TSS_BUILDRUNS);
+
+	/*
+	 * Final call might require no sorting, in rare cases where we just so
+	 * happen to have previously LACKMEM()'d at the point where exactly all
+	 * remaining tuples are loaded into memory, just before input was
+	 * exhausted.
+	 *
+	 * Do not increment run number, or mark the run as over -- this is a
+	 * no-op.
+	 */
+	if (state->memtupcount == 0)
+		return;
+
+	state->currentRun++;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "starting quicksort of run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	tuplesort_sort_memtuples(state);
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksorting run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	memtupwrite = state->memtupcount;
+	for (i = 0; i < memtupwrite; i++)
+	{
+		WRITETUP(state, state->tp_tapenum[state->destTape],
+				 &state->memtuples[i]);
+		state->memtupcount--;
+	}
+	markrunend(state, state->tp_tapenum[state->destTape]);
+	state->tp_runs[state->destTape]++;
+	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished writing run %d to tape %d: %s",
+			 state->currentRun, state->destTape,
+			 pg_rusage_show(&state->ru_start));
+#endif
+
+	if (!alltuples)
+		selectnewtape(state);
+}
+
+/*
  * tuplesort_rescan		- rewind and replay the scan
  */
 void
@@ -2788,7 +3256,8 @@ void
 tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed)
+					long *spaceUsed,
+					int *rowsSortedMem)
 {
 	/*
 	 * Note: it might seem we should provide both memory and disk usage for a
@@ -2817,15 +3286,23 @@ tuplesort_get_stats(Tuplesortstate *state,
 				*sortMethod = "top-N heapsort";
 			else
 				*sortMethod = "quicksort";
+			*rowsSortedMem = state->memtupcount;
 			break;
 		case TSS_SORTEDONTAPE:
 			*sortMethod = "external sort";
+			*rowsSortedMem = 0;
+			break;
+		case TSS_MEMTAPEMERGE:
+			*sortMethod = "quicksort with spillover";
+			*rowsSortedMem = state->memtupcount;
 			break;
 		case TSS_FINALMERGE:
 			*sortMethod = "external merge";
+			*rowsSortedMem = 0;
 			break;
 		default:
 			*sortMethod = "still in progress";
+			*rowsSortedMem = -1;
 			break;
 	}
 }
@@ -2836,10 +3313,19 @@ tuplesort_get_stats(Tuplesortstate *state,
  *
  * Compare two SortTuples.  If checkIndex is true, use the tuple index
  * as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never maintained
+ * beyond the first run, and so there are no COMPARETUP() calls beyond the
+ * first run.  It is assumed that checkIndex callers are maintaining the
+ * heap invariant for a replacement selection priority queue, but those
+ * callers do not go on to trust the heap to be fully-heapified past the
+ * first run.  Once currentRun isn't the first, memtuples is no longer a
+ * heap at all.
  */
 
 #define HEAPCOMPARE(tup1,tup2) \
-	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+					(tup1)->tupindex != 0) ? \
 	 ((tup1)->tupindex) - ((tup2)->tupindex) : \
 	 COMPARETUP(state, tup1, tup2))
 
@@ -2938,6 +3424,30 @@ sort_bounded_heap(Tuplesortstate *state)
 }
 
 /*
+ * Sort all memtuples.
+ *
+ * Quicksort is tuplesort's internal sort algorithm.  It is also generally
+ * preferred to replacement selection of runs during external sorts, except
+ * where incrementally spilling (using a heap) is particularly beneficial.
+ */
+static void
+tuplesort_sort_memtuples(Tuplesortstate *state)
+{
+	if (state->memtupcount > 1)
+	{
+		/* Can we use the single-key sort function? */
+		if (state->onlyKey != NULL)
+			qsort_ssup(state->memtuples, state->memtupcount,
+					   state->onlyKey);
+		else
+			qsort_tuple(state->memtuples,
+						state->memtupcount,
+						state->comparetup,
+						state);
+	}
+}
+
+/*
  * Insert a new tuple into an empty or existing heap, maintaining the
  * heap invariant.  Caller is responsible for ensuring there's room.
  *
@@ -2965,6 +3475,17 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 	memtuples = state->memtuples;
 	Assert(state->memtupcount < state->memtupsize);
 
+	/*
+	 * Once incremental heap spilling is abandoned, this routine should not be
+	 * called when loading runs.  memtuples will be an array of tuples in no
+	 * significant order, so calling here is inappropriate.  Even when
+	 * incremental spilling is still in progress, this routine does not handle
+	 * the second run's tuples (those are heapified to a limited extent that
+	 * they are appended, and thus kept away from those tuples in the first
+	 * run).
+	 */
+	Assert(!checkIndex || tupleindex == 0);
+
 	CHECK_FOR_INTERRUPTS();
 
 	/*
@@ -2996,6 +3517,13 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
 	int			i,
 				n;
 
+	/*
+	 * Once incremental heap spilling is abandoned, this routine should not be
+	 * called when loading runs.  memtuples will be an array of tuples in no
+	 * significant order, so calling here is inappropriate.
+	 */
+	Assert(!checkIndex || state->currentRun == 0);
+
 	if (--state->memtupcount <= 0)
 		return;
 
@@ -3242,7 +3770,7 @@ readtup_heap(Tuplesortstate *state, SortTuple *stup,
 	HeapTupleData htup;
 
 	USEMEM(state, GetMemoryChunkSpace(tuple));
-	/* read in the tuple proper */
+	/* read in the tuple itself */
 	tuple->t_len = tuplen;
 	LogicalTapeReadExact(state->tapeset, tapenum,
 						 tupbody, tupbodylen);
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 97cb859..95acc1d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -335,7 +335,8 @@ extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
 
-extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
+extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+				 double reltuples);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..5504b7b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -743,7 +743,7 @@ extern void BTreeShmemInit(void);
 typedef struct BTSpool BTSpool; /* opaque type known only within nbtsort.c */
 
 extern BTSpool *_bt_spoolinit(Relation heap, Relation index,
-			  bool isunique, bool isdead);
+			  bool isunique, bool isdead, double reltuples);
 extern void _bt_spooldestroy(BTSpool *btspool);
 extern void _bt_spool(BTSpool *btspool, ItemPointer self,
 		  Datum *values, bool *isnull);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..e6144f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -21,6 +21,8 @@ extern TupleTableSlot *ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
+extern double agg_input_rows(AggState *aggstate);
+
 extern Size hash_agg_entry_size(int numAggs);
 
 extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..478a66f 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,22 +62,22 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem, double rowCount, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
-						int workMem, bool randomAccess);
+						int workMem, double rowCount, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
-							int workMem, bool randomAccess);
+							int workMem, double rowCount, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
 						   uint32 hash_mask,
-						   int workMem, bool randomAccess);
+						   int workMem, double rowCount, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess);
+					  int workMem, double rowCount, bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 
@@ -109,7 +109,8 @@ extern void tuplesort_end(Tuplesortstate *state);
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed);
+					long *spaceUsed,
+					int *rowsSortedMem);
 
 extern int	tuplesort_merge_order(int64 allowedMem);
 
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to