[HACKERS] Using quicksort for every external sort run

Peter Geoghegan Wed, 19 Aug 2015 19:25:25 -0700

I'll start a new thread for this, since my external sorting patch has
now evolved well past the original "quicksort with spillover"
idea...although not quite how I anticipated it would. It seems like
I've reached a good point to get some feedback.


I attach a patch series featuring a new, more comprehensive approach
to quicksorting runs during external sorts. What I have now still
includes "quicksort with spillover", but it's just a part of a larger
project. I am quite happy with the improvements in performance shown
by my testing, which I go into below.

Controversy
=========

A few weeks ago, I did not anticipate that I'd propose that
replacement selection sort be used far less (only somewhat less, since
I was only somewhat doubtful about the algorithm at the time). I had
originally planned on continuing to *always* use it for the first run,
to make "quicksort with spillover" possible (thereby sometimes
avoiding significant I/O by not spilling most tuples), but also to
make cases always considered sympathetic to replacement selection
continue to happen. I thought that second or subsequent runs could
still be quicksorted, but that I must still care about this latter
category, the traditional sympathetic cases. This latter category is
mostly just one important property of replacement selection: even
without a strong logical/physical correlation, the algorithm tends to
produce runs that are about twice the size of work_mem. (It's also
notable that replacement selection only produces one run with mostly
presorted input, even where input far exceeds work_mem, which is a
neat trick.)

I wanted to avoid controversy, but the case for the controversy is too
strong for me to ignore: despite these upsides, replacement selection
is obsolete, and should usually be avoided.

Replacement selection sort still has a role to play in making
"quicksort with spillover" possible (when a sympathetic case is
*anticipated*), but other than that it seems generally inferior to a
simple hybrid sort-merge strategy on modern hardware. By modern
hardware, I mean anything manufactured in at least the last 20 years.
We've already seen that the algorithm's use of a heap works badly with
modern CPU caches, but that is just one factor contributing to its
obsolescence.

The big selling point of replacement selection sort in the 20th
century was that it sometimes avoided multi-pass sorts as compared to
a simple sort-merge strategy (remember when tuplesort.c always used 7
tapes? When you need to use 7 actual magnetic tapes, rewinding is
expensive and in general this matters a lot!). We all know that memory
capacity has grown enormously since then, but we must also consider
another factor: At the same time, a simple hybrid sort-merge
strategy's capacity to more or less get the important detail here
right -- to avoid a multi-pass sort -- has increased quadratically
(relative to work_mem/memory capacity). As an example, testing shows
that for a datum tuplesort that requires about 2300MB of work_mem to
be completed as a simple internal sort this patch only needs 30MB to
just do one pass (see benchmark query below). I've mostly regressed
that particular property of tuplesort (it used to be less than 30MB),
but that's clearly the wrong thing to worry about for all kinds of
reasons, probably even in the unimportant cases now forced to do
multiple passes.

Multi-pass sorts
---------------------

I believe, in general, that we should consider a multi-pass sort to be
a kind of inherently suspect thing these days, in the same way that
checkpoints occurring 5 seconds apart are: not actually abnormal, but
something that we should regard suspiciously. Can you really not
afford enough work_mem to only do one pass? Does it really make sense
to add far more I/O and CPU costs to avoid that other tiny memory
capacity cost?

In theory, the answer could be "yes", but it seems highly unlikely.
Not only is very little memory required to avoid a multi-pass merge
step, but as described above the amount required grows very slowly
relative to linear growth in input. I propose to add a
checkpoint_warning style warning (with a checkpoint_warning style GUC
to control it). ISTM that these days, multi-pass merges are like
saving $2 on replacing a stairwell light bulb, at the expense of
regularly stumbling down the stairs in the dark. It shouldn't matter
if you have a 50 terabyte decision support database or if you're
paying Heroku a small monthly fee to run a database backing your web
app: simply avoiding multi-pass merges is probably always the most
economical solution, and by a wide margin.

Note that I am not skeptical of polyphase merging itself, even though
it is generally considered to be a complimentary technique to
replacement selection (some less formal writing on external sorting
seemingly fails to draw a sharp distinction). Nothing has changed
there.

Patch, performance
===============

Let's focus on a multi-run sort, that does not use "quicksort with
spillover", since that is all new, and is probably the most compelling
case for very large databases with hundreds of gigabytes of data to
sort.

I think that this patch requires a machine with more I/O bandwidth
than my laptop to get a proper sense of the improvement made. I've
been using a tmpfs temp_tablespace for testing, to simulate this. That
may leave me slightly optimistic about I/O costs, but you can usually
get significantly more sequential I/O bandwidth by adding additional
disks, whereas you cannot really buy new hardware to improve the
situation with excessive CPU cache misses.

Benchmark
---------------

-- Setup, 100 million tuple table with high cardinality int4 column (2
billion possible int4 values)
create table big_high_cardinality_int4 as
  select (random() * 2000000000)::int4 s,
  'abcdefghijlmn'::text junk
  from generate_series(1, 100000000);
-- Make cost model hinting accurate:
analyze big_high_cardinality_int4;
checkpoint;

Let's start by comparing an external sort that uses 1/3 the memory of
an internal sort against the master branch.  That's completely unfair
on the patch, of course, but it is a useful indicator of how well
external sorts do overall. Although an external sort surely cannot be
as fast as an internal sort, it might be able to approach an internal
sort's speed when there is plenty of I/O bandwidth. That's a good
thing to aim for, I think.

-- Master (just enough memory for internal sort):
set work_mem = '2300MB';
select count(distinct(s)) from big_high_cardinality;

***** Runtime after stabilization: ~33.6 seconds *****

-- Patch series, but with just over 1/3 the memory:
set work_mem = '800MB';
select count(distinct(s)) from big_high_cardinality;

***** Runtime after stabilization: ~37.1 seconds *****

The patch only takes ~10% more time to execute this query, which seems
very good considering that ~1/3 the work_mem has been put to use.

trace_sort output for patch during execution of this case:

LOG:  begin datum sort: workMem = 819200, randomAccess = f
LOG:  switching to external sort with 2926 tapes: CPU 0.39s/2.66u sec
elapsed 3.06 sec
LOG:  replacement selection avg tuple size 24.00 crossover: 0.85
LOG:  hybrid sort-merge in use from row 34952532 with 100000000.00 total rows
LOG:  finished quicksorting run 1: CPU 0.39s/8.84u sec elapsed 9.24 sec
LOG:  finished writing quicksorted run 1 to tape 0: CPU 0.60s/9.61u
sec elapsed 10.22 sec
LOG:  finished quicksorting run 2: CPU 0.87s/18.61u sec elapsed 19.50 sec
LOG:  finished writing quicksorted run 2 to tape 1: CPU 1.07s/19.38u
sec elapsed 20.46 sec
LOG:  performsort starting: CPU 1.27s/21.79u sec elapsed 23.07 sec
LOG:  finished quicksorting run 3: CPU 1.27s/27.07u sec elapsed 28.35 sec
LOG:  finished writing quicksorted run 3 to tape 2: CPU 1.47s/27.69u
sec elapsed 29.18 sec
LOG:  performsort done (except 3-way final merge): CPU 1.51s/28.54u
sec elapsed 30.07 sec
LOG:  external sort ended, 146625 disk blocks used: CPU 1.76s/35.32u
sec elapsed 37.10 sec

Note that the on-tape runs are small relative to CPU costs, so this
query is a bit sympathetic (consider the time spent writing batches
that trace_sort indicates here). CREATE INDEX would not compare so
well with an internal sort, for example, especially if it was a
composite index or something. I've sized work_mem here in a deliberate
way, to make sure there are 3 runs of similar size by the time the
merge step is reached, which makes a small difference in the patch's
favor. All told, this seems like a very significant overall
improvement.

Now, consider master's performance with the same work_mem setting (a
fair test with comparable resource usage for master and patch):

-- Master
set work_mem = '800MB';
select count(distinct(s)) from big_high_cardinality;

***** Runtime after stabilization: ~120.9 seconds *****

The patch is ~3.25x faster than master here, which also seems like a
significant improvement. That's pretty close to the improvement
previously seen for good "quicksort with spillover" cases, but
suitable for every external sort case that doesn't use "quicksort with
spillover". In other words, no variety of external sort is not
significantly improved by the patch.

I think it's safe to suppose that there are also big benefits when
multiple concurrent sort operations run on the same system. For
example, when pg_restore has multiple jobs.

Worst case
---------------

Even with a traditionally sympathetic case for replacement selection
sort, the patch beats replacement selection with multiple on-tape
runs. When experimenting here, I did not forget to account for our
qsort()'s behavior in the event of *perfectly* presorted input
("Bubble sort best case" behavior [1]). Other than that, I have a hard
time thinking of an unsympathetic case for the patch, and could not
find any actual regressions with a fair amount of effort.

Abbreviated keys are not used when merging, but that doesn't seem to
be something that notably counts against the new approach (which will
have shorter runs on average). After all, the reason why abbreviated
keys aren't saved on disk for merging is that they're probably not
very useful when merging. They would resolve far fewer comparisons if
they were used during merging, and having somewhat smaller runs does
not result in significantly more non-abbreviated comparisons, even
when sorting random noise strings.

Avoiding replacement selection *altogether*
=================================

Assuming you agree with my conclusions on replacement selection sort
mostly not being worth it, we need to avoid replacement selection
except when it'll probably allow a "quicksort with spillover". In my
mind, that's now the *only* reason to use replacement selection.
Callers pass a hint to tuplesort indicating how many tuples it is
estimated will ultimately be passed before a sort is performed.
(Typically, this comes from a scan plan node's row estimate, or more
directly from the relcache for things like CREATE INDEX.)

Cost model -- details
----------------------------

Second or subsequent runs *never* use replacement selection -- it is
only *considered* for the first run, right before the possible point
of initial heapification within inittapes(). The cost model is
contained within the new function useselection(). See the second patch
in the series for full details. That's where this is added.

I have a fairly high bar for even using replacement selection for the
first run -- several factors can result in a simple hybrid sort-merge
strategy being used instead of a "quicksort with spillover", because
in general most of the benefit seems to be around CPU cache misses
rather than savings in I/O. Consider my benchmark query above once
more -- with replacement selection used for the first run in the
benchmark case above (e.g., with just the first patch in the series
applied, or setting the "optimize_avoid_selection" debug GUC to
"off"), I found that it took over twice as long to execute, even
though the second-or-subsequent (now smaller) runs were quicksorted
just the same, and were all merged just the same.

The numbers should make it obvious why I gave in to the temptation of
adding an ad-hoc, tuplesort-private cost model. At this point, I'd
rather scrap "quicksort with spillover" (and the use of replacement
selection under all possible circumstances) than scrap the idea of a
cost model. That would make more sense, even though it would give up
on the idea of saving most I/O where the work_mem threshold is only
crossed by a small amount.

Future work
=========

I anticipate a number of other things within the first patch in the
series, some of which are already worked out to some degree.

Asynchronous I/O
-------------------------

This patch leaves open the possibility of using something like
libaio/librt for sorting. That would probably use half of memtuples as
scratch space, while the other half is quicksorted.

Memory prefetching
---------------------------

To test what role memory prefetching is likely to have here, I attach
a custom version of my tuplesort/tuplestore prefetch patch, with
prefetching added to the "quicksort with spillover" and batch dumping
runs WRITETUP()-calling code. This seems to help performance
measurably. However, I guess it shouldn't really be considered as part
of this patch. It can follow the initial commit of the big, base patch
(or will becomes part of the base patch if and when prefetching is
committed first).

cost_sort() changes
--------------------------

I had every intention of making cost_sort() a continuous cost function
as part of this work. This could be justified by "quicksort with
spillover" allowing tuplesort to "blend" from internal to external
sorting as input size is gradually increased. This seemed like
something that would have significant non-obvious benefits in several
other areas. However, I've put off dealing with making any change to
cost_sort() because of concerns about the complexity of overlaying
such changes on top of the tuplesort-private cost model.

I think that this will need to be discussed in a lot more detail. As a
further matter, materialization of sort nodes will probably also
require tweaks to the costing for "quicksort with spillover". Recall
that "quicksort with spillover" can only work for !randomAccess
tuplesort callers.

Run size
------------

This patch continues to have tuplesort determine run size based on the
availability of work_mem only. It does not entirely fix the problem of
having work_mem sizing impact performance in counter-intuitive ways.
In other words, smaller work_mem sizes can still be faster. It does
make that general situation much better, though, because quicksort is
a cache oblivious algorithm. Smaller work_mem sizes are sometimes a
bit faster, but never dramatically faster.

In general, the whole idea of making run size as big as possible is
bogus, unless that enables or is likely to enable a "quicksort with
spillover". The caller-supplied row count hint I've added may in the
future be extended to determine optimal run size ahead of time, when
it's perfectly clear (leaving aside misestimation) that a fully
internal sort (or "quicksort with spillover") will not occur. This
will result in faster external sorts where additional work_mem cannot
be put to good use. As a side benefit, external sorts will not be
effectively wasting a large amount of memory.

The cost model we eventually come up with to determine optimal run
size ought to balance certain things. Assuming a one-pass merge step,
then we should balance the time lost waiting on the first run and time
quicksorting the last run with the gradual increase in the cost during
the merge step. Maybe the non-use of abbreviated keys during the merge
step should also be considered. Alternatively, the run size may be
determined by a GUC that is typically sized at drive controller cache
size (e.g. 1GB) when any kind of I/O avoidance for the sort appears
impossible.

[1] Commit a3f0b3d6
-- 
Peter Geoghegan

From d861c20bc2dab02d81040c9a40252b2e5ad08ef3 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 5/5] Add cursory regression tests for sorting

This is not intended to be a formal patch submission.  Tests are added
that happened to be useful during the development of the "quicksort with
spillover" patch, as the regression tests currently have precisely zero
coverage for any variety of external sort operation.  The tests are
provided as a convenience to reviewers of that patch only.

In the long run, there should be comprehensive smoke-testing of these
cases (probably not in the standard regression tests), but this patch
does not pretend to be any kind of basis for that.

The tests added have a number of obvious problems:

* They take far too long to run to be in the standard regression test
suite, and yet they're run as part of that suite.  With a little effort,
they could be made to run faster with no appreciable loss of coverage,
but that didn't happen.

* They're far from comprehensive.

* The tests require several GB of disk space to run.

* They're not portable.  They might even be extremely non-portable due
to implementation differences across platform pseudo-random number
generators.

The important point is that each query tested gives consistent results.
There is no reason to think that varying work_mem settings will affect
which basic approach to sorting each query takes as compared to during
my original testing (at least assuming a 64-bit platform), which is also
important.
---
 src/test/regress/expected/sorting.out | 309 ++++++++++++++++++++++++++++++++++
 src/test/regress/parallel_schedule    |   1 +
 src/test/regress/serial_schedule      |   1 +
 src/test/regress/sql/sorting.sql      | 115 +++++++++++++
 4 files changed, 426 insertions(+)
 create mode 100644 src/test/regress/expected/sorting.out
 create mode 100644 src/test/regress/sql/sorting.sql

diff --git a/src/test/regress/expected/sorting.out b/src/test/regress/expected/sorting.out
new file mode 100644
index 0000000..6db00b7
--- /dev/null
+++ b/src/test/regress/expected/sorting.out
@@ -0,0 +1,309 @@
+--
+-- sorting tests
+--
+-- Seed PRNG;  this probably isn't portable
+select setseed(1);
+ setseed 
+---------
+ 
+(1 row)
+
+--
+-- int4 test (10 million tuples, medium cardinality)
+--
+create unlogged table int4_sort_tbl as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, high cardinality)
+--
+create unlogged table highcard_int4_sort_tbl as
+  select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, low cardinality)
+--
+create unlogged table lowcard_int4_sort_tbl as
+  select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+--
+-- int4 test (10 million tuples, medium cardinality, correlated)
+--
+create unlogged table int4_sort_tbl_correlated as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000) order by 1 asc;
+--
+-- int4 test (10 million tuples, medium cardinality, inverse correlated)
+--
+create unlogged table int4_sort_tbl_inverse as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000) order by 1 desc;
+-- Results should be consistent:
+set work_mem = '64MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+set work_mem = '100MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+set work_mem = '110MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+set work_mem = '128MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+set work_mem = '140MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+set work_mem = '150MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+-- should be in-memory:
+set work_mem = '512MB';
+select count(distinct(s)) from int4_sort_tbl;
+ count  
+--------
+ 999949
+(1 row)
+
+select count(distinct(s)) from highcard_int4_sort_tbl;
+  count  
+---------
+ 9515397
+(1 row)
+
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+ count 
+-------
+   101
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_correlated;
+ count  
+--------
+ 999963
+(1 row)
+
+select count(distinct(s)) from int4_sort_tbl_inverse;
+ count  
+--------
+ 999952
+(1 row)
+
+--
+-- text test (uses abbreviated keys, 10 million tuples)
+--
+select setseed(1);
+ setseed 
+---------
+ 
+(1 row)
+
+create unlogged table text_sort_tbl as
+  select (random() * 100000)::int4::text s
+  from generate_series(1, 10000000);
+-- Start with sort that results in 3-way final merge:
+set work_mem = '190MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+-- Uses optimization where it's marginal:
+set work_mem = '200MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+-- Uses optimization where it's favorable:
+set work_mem = '450MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+-- internal sort:
+set work_mem = '500MB';
+select count(distinct(s)) from text_sort_tbl;
+ count  
+--------
+ 100001
+(1 row)
+
+drop table int4_sort_tbl;
+drop table highcard_int4_sort_tbl;
+drop table lowcard_int4_sort_tbl;
+drop table text_sort_tbl;
+drop table int4_sort_tbl_correlated;
+drop table int4_sort_tbl_inverse;
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 4df15de..7ff656a 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -37,6 +37,7 @@ test: geometry horology regex oidjoins type_sanity opr_sanity
 # ----------
 test: insert
 test: insert_conflict
+test: sorting
 test: create_function_1
 test: create_type
 test: create_table
diff --git a/src/test/regress/serial_schedule b/src/test/regress/serial_schedule
index 15d74d4..ebe7de0 100644
--- a/src/test/regress/serial_schedule
+++ b/src/test/regress/serial_schedule
@@ -51,6 +51,7 @@ test: type_sanity
 test: opr_sanity
 test: insert
 test: insert_conflict
+test: sorting
 test: create_function_1
 test: create_type
 test: create_table
diff --git a/src/test/regress/sql/sorting.sql b/src/test/regress/sql/sorting.sql
new file mode 100644
index 0000000..ae51b97
--- /dev/null
+++ b/src/test/regress/sql/sorting.sql
@@ -0,0 +1,115 @@
+--
+-- sorting tests
+--
+
+-- Seed PRNG;  this probably isn't portable
+select setseed(1);
+
+--
+-- int4 test (10 million tuples, medium cardinality)
+--
+create unlogged table int4_sort_tbl as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, high cardinality)
+--
+create unlogged table highcard_int4_sort_tbl as
+  select (random() * 100000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, low cardinality)
+--
+create unlogged table lowcard_int4_sort_tbl as
+  select (random() * 100)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000);
+
+--
+-- int4 test (10 million tuples, medium cardinality, correlated)
+--
+create unlogged table int4_sort_tbl_correlated as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000) order by 1 asc;
+
+--
+-- int4 test (10 million tuples, medium cardinality, inverse correlated)
+--
+create unlogged table int4_sort_tbl_inverse as
+  select (random() * 1000000)::int4 s, 'abcdefghijlmn'::text junk
+  from generate_series(1, 10000000) order by 1 desc;
+
+-- Results should be consistent:
+set work_mem = '64MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '100MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '110MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '128MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '140MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+set work_mem = '150MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+-- should be in-memory:
+set work_mem = '512MB';
+select count(distinct(s)) from int4_sort_tbl;
+select count(distinct(s)) from highcard_int4_sort_tbl;
+select count(distinct(s)) from lowcard_int4_sort_tbl;
+select count(distinct(s)) from int4_sort_tbl_correlated;
+select count(distinct(s)) from int4_sort_tbl_inverse;
+
+--
+-- text test (uses abbreviated keys, 10 million tuples)
+--
+select setseed(1);
+create unlogged table text_sort_tbl as
+  select (random() * 100000)::int4::text s
+  from generate_series(1, 10000000);
+
+-- Start with sort that results in 3-way final merge:
+set work_mem = '190MB';
+select count(distinct(s)) from text_sort_tbl;
+-- Uses optimization where it's marginal:
+set work_mem = '200MB';
+select count(distinct(s)) from text_sort_tbl;
+-- Uses optimization where it's favorable:
+set work_mem = '450MB';
+select count(distinct(s)) from text_sort_tbl;
+-- internal sort:
+set work_mem = '500MB';
+select count(distinct(s)) from text_sort_tbl;
+
+drop table int4_sort_tbl;
+drop table highcard_int4_sort_tbl;
+drop table lowcard_int4_sort_tbl;
+drop table text_sort_tbl;
+drop table int4_sort_tbl_correlated;
+drop table int4_sort_tbl_inverse;
+
-- 
1.9.1

From f7befa582960c0034b24180c65eadb4ffc4dfe6d Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 12 Jul 2015 13:14:01 -0700
Subject: [PATCH 4/5] Prefetch from memtuples array in tuplesort

This patch is almost the same as a canonical version, which appears
here:  https://commitfest.postgresql.org/6/305/

This version adds some additional tricks specific to new cases for
external sorts.  Of particular interest here is the prefetching of each
"tuple proper" during writing of batches of tuples.  This is not
intended to be reviewed as part of the external sorting work, and is
provided only as a convenience to reviewers who would like to see where
prefetching can help with external sorts, too.

Original canonical version details:

Testing shows that prefetching the "tuple proper" of a slightly later
SortTuple in the memtuples array during each of many sequential,
in-logical-order SortTuple fetches speeds up various sorting intense
operations considerably.  For example, B-Tree index builds are
accelerated as leaf pages are created from the memtuples array.
(i.e.  The operation following actually "performing" the sort, but
before a tuplesort_end() call is made as a B-Tree spool is destroyed.)

Similarly, ordered set aggregates (all cases except the datumsort case
with a pass-by-value type), and regular heap tuplesorts benefit to about
the same degree.  The optimization is only used when sorts fit in
memory, though.

Also, prefetch a few places ahead within the analogous "fetching" point
in tuplestore.c.  This appears to offer similar benefits in certain
cases.  For example, queries involving large common table expressions
significantly benefit.
---
 config/c-compiler.m4                | 17 +++++++++++++++++
 configure                           | 30 ++++++++++++++++++++++++++++++
 configure.in                        |  1 +
 src/backend/utils/sort/tuplesort.c  | 36 ++++++++++++++++++++++++++++++++++++
 src/backend/utils/sort/tuplestore.c | 13 +++++++++++++
 src/include/c.h                     | 14 ++++++++++++++
 src/include/pg_config.h.in          |  3 +++
 src/include/pg_config.h.win32       |  3 +++
 src/include/pg_config_manual.h      | 10 ++++++++++
 9 files changed, 127 insertions(+)

diff --git a/config/c-compiler.m4 b/config/c-compiler.m4
index 397e1b0..c730da5 100644
--- a/config/c-compiler.m4
+++ b/config/c-compiler.m4
@@ -253,6 +253,23 @@ fi])# PGAC_C_BUILTIN_UNREACHABLE
 
 
 
+# PGAC_C_BUILTIN_PREFETCH
+# -------------------------
+# Check if the C compiler understands __builtin_prefetch(),
+# and define HAVE__BUILTIN_PREFETCH if so.
+AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
+[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
+[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],
+[int i = 0;__builtin_prefetch(&i, 0, 3);])],
+[pgac_cv__builtin_prefetch=yes],
+[pgac_cv__builtin_prefetch=no])])
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
+          [Define to 1 if your compiler understands __builtin_prefetch.])
+fi])# PGAC_C_BUILTIN_PREFETCH
+
+
+
 # PGAC_C_VA_ARGS
 # --------------
 # Check if the C compiler understands C99-style variadic macros,
diff --git a/configure b/configure
index ebb5cac..a3a413f 100755
--- a/configure
+++ b/configure
@@ -11315,6 +11315,36 @@ if test x"$pgac_cv__builtin_unreachable" = xyes ; then
 $as_echo "#define HAVE__BUILTIN_UNREACHABLE 1" >>confdefs.h
 
 fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_prefetch" >&5
+$as_echo_n "checking for __builtin_prefetch... " >&6; }
+if ${pgac_cv__builtin_prefetch+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+main ()
+{
+int i = 0;__builtin_prefetch(&i, 0, 3);
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_compile "$LINENO"; then :
+  pgac_cv__builtin_prefetch=yes
+else
+  pgac_cv__builtin_prefetch=no
+fi
+rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_prefetch" >&5
+$as_echo "$pgac_cv__builtin_prefetch" >&6; }
+if test x"$pgac_cv__builtin_prefetch" = xyes ; then
+
+$as_echo "#define HAVE__BUILTIN_PREFETCH 1" >>confdefs.h
+
+fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __VA_ARGS__" >&5
 $as_echo_n "checking for __VA_ARGS__... " >&6; }
 if ${pgac_cv__va_args+:} false; then :
diff --git a/configure.in b/configure.in
index a28f9dd..778dd61 100644
--- a/configure.in
+++ b/configure.in
@@ -1319,6 +1319,7 @@ PGAC_C_TYPES_COMPATIBLE
 PGAC_C_BUILTIN_BSWAP32
 PGAC_C_BUILTIN_CONSTANT_P
 PGAC_C_BUILTIN_UNREACHABLE
+PGAC_C_BUILTIN_PREFETCH
 PGAC_C_VA_ARGS
 PGAC_STRUCT_TIMEZONE
 PGAC_UNION_SEMUN
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index f856cf0..e60f561 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -1796,6 +1796,26 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 				if (state->current < state->memtupcount)
 				{
 					*stup = state->memtuples[state->current++];
+
+					/*
+					 * Perform memory prefetch of "tuple proper" of the
+					 * SortTuple that's three places ahead of current
+					 * (which is returned to caller).  Testing shows that
+					 * this significantly boosts the performance for
+					 * TSS_SORTEDINMEM "forward" callers by hiding memory
+					 * latency behind their processing of returned tuples.
+					 *
+					 * Don't do this for pass-by-value datum sorts;  even
+					 * though hinting a NULL address does not affect
+					 * correctness, it would have a noticeable overhead
+					 * here.
+					 */
+#ifdef USE_MEM_PREFETCH
+					if (stup->tuple != NULL &&
+						state->current + 2 < state->memtupcount)
+						pg_rfetch(state->memtuples[state->current + 2].tuple);
+#endif
+
 					return true;
 				}
 				state->eof_reached = true;
@@ -2024,6 +2044,17 @@ just_memtuples:
 			if (state->current < state->memtupcount)
 			{
 				*stup = state->memtuples[state->current++];
+
+				/*
+				 * Once this point is reached, rationale for memory
+				 * prefetching is identical to TSS_SORTEDINMEM case.
+				 */
+#ifdef USE_MEM_PREFETCH
+				if (stup->tuple != NULL &&
+					state->current + 2 < state->memtupcount)
+					pg_rfetch(state->memtuples[state->current + 2].tuple);
+#endif
+
 				return true;
 			}
 			state->eof_reached = true;
@@ -3142,6 +3173,11 @@ dumpbatch(Tuplesortstate *state, bool alltuples)
 		WRITETUP(state, state->tp_tapenum[state->destTape],
 				 &state->memtuples[i]);
 		state->memtupcount--;
+
+#ifdef USE_MEM_PREFETCH
+		if (state->memtuples[i].tuple != NULL && i + 2 < memtupwrite)
+			pg_rfetch(state->memtuples[i + 2].tuple);
+#endif
 	}
 	markrunend(state, state->tp_tapenum[state->destTape]);
 	state->tp_runs[state->destTape]++;
diff --git a/src/backend/utils/sort/tuplestore.c b/src/backend/utils/sort/tuplestore.c
index 51f474d..15f956d 100644
--- a/src/backend/utils/sort/tuplestore.c
+++ b/src/backend/utils/sort/tuplestore.c
@@ -902,6 +902,19 @@ tuplestore_gettuple(Tuplestorestate *state, bool forward,
 					return NULL;
 				if (readptr->current < state->memtupcount)
 				{
+					/*
+					 * Perform memory prefetch of tuple that's three places
+					 * ahead of current (which is returned to caller).
+					 * Testing shows that this significantly boosts the
+					 * performance for TSS_INMEM "forward" callers by
+					 * hiding memory latency behind their processing of
+					 * returned tuples.
+					 */
+#ifdef USE_MEM_PREFETCH
+					if (readptr->current + 3 < state->memtupcount)
+						pg_rfetch(state->memtuples[readptr->current + 3]);
+#endif
+
 					/* We have another tuple, so return it */
 					return state->memtuples[readptr->current++];
 				}
diff --git a/src/include/c.h b/src/include/c.h
index b719eb9..67e3063 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -932,6 +932,20 @@ typedef NameData *Name;
 #define pg_unreachable() abort()
 #endif
 
+/*
+ * Prefetch support -- Support memory prefetching hints on some platforms.
+ *
+ * pg_rfetch() is specialized for the case where an array is accessed
+ * sequentially, and we can prefetch a pointer within the next element (or an
+ * even later element) in order to hide memory latency.  This case involves
+ * prefetching addresses with low temporal locality.  Note that it's rather
+ * difficult to get any kind of speedup using pg_rfetch();  any use of the
+ * intrinsic should be carefully tested.  Also note that it's okay to pass it
+ * an invalid or NULL address, although it's best avoided.
+ */
+#if defined(USE_MEM_PREFETCH)
+#define pg_rfetch(addr)		__builtin_prefetch((addr), 0, 0)
+#endif
 
 /* ----------------------------------------------------------------
  *				Section 8:	random stuff
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 9285c62..a8e5683 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -669,6 +669,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 #undef HAVE__BUILTIN_CONSTANT_P
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P
 
diff --git a/src/include/pg_config.h.win32 b/src/include/pg_config.h.win32
index ad61392..a2f6eb3 100644
--- a/src/include/pg_config.h.win32
+++ b/src/include/pg_config.h.win32
@@ -523,6 +523,9 @@
 /* Define to 1 if your compiler understands __builtin_constant_p. */
 /* #undef HAVE__BUILTIN_CONSTANT_P */
 
+/* Define to 1 if your compiler understands __builtin_prefetch. */
+#undef HAVE__BUILTIN_PREFETCH
+
 /* Define to 1 if your compiler understands __builtin_types_compatible_p. */
 /* #undef HAVE__BUILTIN_TYPES_COMPATIBLE_P */
 
diff --git a/src/include/pg_config_manual.h b/src/include/pg_config_manual.h
index e278fa0..4c7b1d5 100644
--- a/src/include/pg_config_manual.h
+++ b/src/include/pg_config_manual.h
@@ -153,6 +153,16 @@
 #endif
 
 /*
+ * USE_MEM_PREFETCH controls whether Postgres will attempt to use memory
+ * prefetching.  Usually the automatic configure tests are sufficient, but
+ * it's conceivable that using prefetching is counter-productive on some
+ * platforms.  If necessary you can remove the #define here.
+ */
+#ifdef HAVE__BUILTIN_PREFETCH
+#define USE_MEM_PREFETCH
+#endif
+
+/*
  * USE_SSL code should be compiled only when compiling with an SSL
  * implementation.  (Currently, only OpenSSL is supported, but we might add
  * more implementations in the future.)
-- 
1.9.1

From 7ef577fb46984fc7e92be54d83355454490c0217 Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Sun, 16 Aug 2015 21:17:16 -0700
Subject: [PATCH 3/5] Log requirement for multiple external sort passes

The new log message warns users about a sort requiring multiple passes.
This is in the same spirit as checkpoint_warning.  It seems very
ill-advised to ever attempt a sort that will require multiple passes on
contemporary hardware, since that can greatly increase the amount of I/O
required, and yet can only occur when available memory is small fraction
of what is required for a fully internal sort.

A new GUC, multipass_warning controls this log message.  The default is
'on'.  Also, a new debug GUC (not available in a standard build) for
controlling whether replacement selection can be avoided in the first
run is added.

During review, this patch may be useful for highlighting how effectively
replacement selection sort prevents multiple passes during the merge
step (relative to a hybrid sort-merge strategy) in practice.
---
 doc/src/sgml/config.sgml           | 22 +++++++++++++++++++++
 src/backend/utils/misc/guc.c       | 29 ++++++++++++++++++++++++---
 src/backend/utils/sort/tuplesort.c | 40 ++++++++++++++++++++++++++++++++++++--
 src/include/utils/guc.h            |  2 ++
 4 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index e900dccb..3cd94a7 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1556,6 +1556,28 @@ include_dir 'conf.d'
      <title>Disk</title>
 
      <variablelist>
+     <varlistentry id="guc-multipass-warning" xreflabel="multipass_warning">
+      <term><varname>multipass_warning</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>multipass_warning</> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Write a message to the server log if an external sort
+        operation requires multiple passes (which suggests that
+        <varname>work_mem</> or <varname>maintenance_work_mem</> may
+        need to be raised).  Only a small fraction of the memory
+        required for an internal sort is required for an external sort
+        that makes no more than a single pass (typically less than
+        1%).  Since multi-pass sorts are often much slower, it is
+        advisable to avoid them altogether whenever possible.
+        The default setting is <literal>on</>.
+        Only superusers can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-temp-file-limit" xreflabel="temp_file_limit">
       <term><varname>temp_file_limit</varname> (<type>integer</type>)
       <indexterm>
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b3dac51..3302648 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -115,8 +115,9 @@ extern bool synchronize_seqscans;
 #ifdef TRACE_SYNCSCAN
 extern bool trace_syncscan;
 #endif
-#ifdef DEBUG_BOUNDED_SORT
+#ifdef DEBUG_SORT
 extern bool optimize_bounded_sort;
+extern bool optimize_avoid_selection;
 #endif
 
 static int	GUC_check_errcode_value;
@@ -1041,6 +1042,16 @@ static struct config_bool ConfigureNamesBool[] =
 		NULL, NULL, NULL
 	},
 	{
+		{"multipass_warning", PGC_SUSET, LOGGING_WHAT,
+			gettext_noop("Enables warnings if external sorts require more than one pass."),
+			gettext_noop("Write a message to the server log if more than one pass is required "
+						 "for an external sort operation.")
+		},
+		&multipass_warning,
+		true,
+		NULL, NULL, NULL
+	},
+	{
 		{"debug_assertions", PGC_INTERNAL, PRESET_OPTIONS,
 			gettext_noop("Shows whether the running server has assertion checks enabled."),
 			NULL,
@@ -1449,8 +1460,8 @@ static struct config_bool ConfigureNamesBool[] =
 	},
 #endif
 
-#ifdef DEBUG_BOUNDED_SORT
-	/* this is undocumented because not exposed in a standard build */
+#ifdef DEBUG_SORT
+	/* these are undocumented because not exposed in a standard build */
 	{
 		{
 			"optimize_bounded_sort", PGC_USERSET, QUERY_TUNING_METHOD,
@@ -1462,6 +1473,18 @@ static struct config_bool ConfigureNamesBool[] =
 		true,
 		NULL, NULL, NULL
 	},
+
+	{
+		{
+			"optimize_avoid_selection", PGC_USERSET, QUERY_TUNING_METHOD,
+			gettext_noop("Enable avoiding replacement selection using heap sort."),
+			NULL,
+			GUC_NOT_IN_SAMPLE
+		},
+		&optimize_avoid_selection,
+		true,
+		NULL, NULL, NULL
+	},
 #endif
 
 #ifdef WAL_DEBUG
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index 6d766d2..f856cf0 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -160,8 +160,11 @@
 bool		trace_sort = false;
 #endif
 
-#ifdef DEBUG_BOUNDED_SORT
+bool		multipass_warning = true;
+
+#ifdef DEBUG_SORT
 bool		optimize_bounded_sort = true;
+bool		optimize_avoid_selection = true;
 #endif
 
 
@@ -250,6 +253,7 @@ struct Tuplesortstate
 {
 	TupSortStatus status;		/* enumerated value as shown above */
 	int			nKeys;			/* number of columns in sort key */
+	bool		querySort;		/* sort associated with query execution */
 	double		rowNumHint;		/* caller's hint of total # of rows */
 	bool		randomAccess;	/* did caller request random access? */
 	bool		bounded;		/* did caller specify a maximum number of
@@ -697,6 +701,7 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 #endif
 
 	state->nKeys = nkeys;
+	state->querySort = true;
 
 	TRACE_POSTGRESQL_SORT_START(HEAP_SORT,
 								false,	/* no unique check */
@@ -771,6 +776,7 @@ tuplesort_begin_cluster(TupleDesc tupDesc,
 #endif
 
 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+	state->querySort = false;
 
 	TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
 								false,	/* no unique check */
@@ -864,6 +870,7 @@ tuplesort_begin_index_btree(Relation heapRel,
 #endif
 
 	state->nKeys = RelationGetNumberOfAttributes(indexRel);
+	state->querySort = false;
 
 	TRACE_POSTGRESQL_SORT_START(INDEX_SORT,
 								enforceUnique,
@@ -939,6 +946,7 @@ tuplesort_begin_index_hash(Relation heapRel,
 #endif
 
 	state->nKeys = 1;			/* Only one sort column, the hash code */
+	state->querySort = false;
 
 	state->comparetup = comparetup_index_hash;
 	state->copytup = copytup_index;
@@ -976,6 +984,7 @@ tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 #endif
 
 	state->nKeys = 1;			/* always a one-column sort */
+	state->querySort = true;
 
 	TRACE_POSTGRESQL_SORT_START(DATUM_SORT,
 								false,	/* no unique check */
@@ -1042,7 +1051,7 @@ tuplesort_set_bound(Tuplesortstate *state, int64 bound)
 	Assert(state->memtupcount == 0);
 	Assert(!state->bounded);
 
-#ifdef DEBUG_BOUNDED_SORT
+#ifdef DEBUG_SORT
 	/* Honor GUC setting that disables the feature (for easy testing) */
 	if (!optimize_bounded_sort)
 		return;
@@ -2309,6 +2318,12 @@ useselection(Tuplesortstate *state)
 	double		crossover;
 	bool		useSelection;
 
+#ifdef DEBUG_SORT
+	/* Honor GUC setting that disables the feature (for easy testing) */
+	if (!optimize_avoid_selection)
+		return true;
+#endif
+
 	/* For randomAccess callers, "quicksort with spillover" is never used */
 	if (state->randomAccess)
 		return false;
@@ -2523,6 +2538,12 @@ selectnewtape(Tuplesortstate *state)
 static void
 mergeruns(Tuplesortstate *state)
 {
+#ifdef TRACE_SORT
+	bool		multiwarned = !(multipass_warning || trace_sort);
+#else
+	bool		multiwarned = !multipass_warning;
+#endif
+
 	int			tapenum,
 				svTape,
 				svRuns,
@@ -2626,6 +2647,21 @@ mergeruns(Tuplesortstate *state)
 		/* Step D6: decrease level */
 		if (--state->Level == 0)
 			break;
+
+		if (!multiwarned)
+		{
+			int64		memNowUsed = state->allowedMem - state->availMem;
+
+			ereport(LOG,
+					(errmsg("a multi-pass external merge sort is required "
+							"(%ld kB memory used. %d tape maximum. %d levels)",
+							memNowUsed / 1024L, state->maxTapes, state->Level + 1),
+					 errhint("Consider increasing the configuration parameter \"%s\".",
+							 state->querySort ?  "work_mem" : "maintenance_work_mem")));
+
+			multiwarned = true;
+		}
+
 		/* rewind output tape T to use as new input */
 		LogicalTapeRewind(state->tapeset, state->tp_tapenum[state->tapeRange],
 						  false);
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index dc167f9..1e1519a 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -272,6 +272,8 @@ extern int	tcp_keepalives_count;
 extern bool trace_sort;
 #endif
 
+extern bool multipass_warning;
+
 /*
  * Functions exported by guc.c
  */
-- 
1.9.1

From f2486568558d4c2cd3ee59af024c3f450d6ba0fa Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Thu, 13 Aug 2015 14:32:32 -0700
Subject: [PATCH 2/5] Further diminish role of replacement selection

Tuplesort callers now provide a total row estimate hint, typically the
optimizer's own estimate.  This is used to determine if replacement
selection will be viable even for the first run.  Testing shows that the
major benefit of replacement selection is only that it may enable a
"quicksort with spillover", which is the sole remaining justification
for going with replacement selection for the first run.  Even the cases
traditionally considered very sympathetic to replacement selection (e.g.
almost sorted input) do not appear to come out ahead on contemporary
hardware, so callers may not provide a physical/logical correlation
hint.  There is surprisingly little reason to try replacement selection
in the event of a strong correlation.

Some of the best cases for a simple hybrid sort-merge strategy can only
be seen when replacement selection isn't even attempted before being
abandoned;  replacement selection's tendency to produce longer runs is a
liability here rather than a benefit.  This change significantly reduces
the frequency that replacement selection will even be attempted
(previously, it was always at least used for the first run).
---
 src/backend/access/hash/hash.c         |   2 +-
 src/backend/access/hash/hashsort.c     |   4 +-
 src/backend/access/nbtree/nbtree.c     |  11 +-
 src/backend/access/nbtree/nbtsort.c    |  10 +-
 src/backend/catalog/index.c            |   1 +
 src/backend/commands/cluster.c         |   4 +-
 src/backend/executor/nodeAgg.c         |  26 ++++-
 src/backend/executor/nodeSort.c        |   1 +
 src/backend/utils/adt/orderedsetaggs.c |  13 ++-
 src/backend/utils/sort/tuplesort.c     | 182 +++++++++++++++++++++++++--------
 src/include/access/hash.h              |   3 +-
 src/include/access/nbtree.h            |   2 +-
 src/include/executor/nodeAgg.h         |   2 +
 src/include/utils/tuplesort.h          |  15 ++-
 14 files changed, 214 insertions(+), 62 deletions(-)

diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 24b06a5..8f71980 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -86,7 +86,7 @@ hashbuild(PG_FUNCTION_ARGS)
 	 * one page.
 	 */
 	if (num_buckets >= (uint32) NBuffers)
-		buildstate.spool = _h_spoolinit(heap, index, num_buckets);
+		buildstate.spool = _h_spoolinit(heap, index, num_buckets, reltuples);
 	else
 		buildstate.spool = NULL;
 
diff --git a/src/backend/access/hash/hashsort.c b/src/backend/access/hash/hashsort.c
index c67c057..5c7e137 100644
--- a/src/backend/access/hash/hashsort.c
+++ b/src/backend/access/hash/hashsort.c
@@ -44,7 +44,8 @@ struct HSpool
  * create and initialize a spool structure
  */
 HSpool *
-_h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
+_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+			 double reltuples)
 {
 	HSpool	   *hspool = (HSpool *) palloc0(sizeof(HSpool));
 	uint32		hash_mask;
@@ -71,6 +72,7 @@ _h_spoolinit(Relation heap, Relation index, uint32 num_buckets)
 												   index,
 												   hash_mask,
 												   maintenance_work_mem,
+												   reltuples,
 												   false);
 
 	return hspool;
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..0957e0f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -23,6 +23,7 @@
 #include "access/xlog.h"
 #include "catalog/index.h"
 #include "commands/vacuum.h"
+#include "optimizer/plancat.h"
 #include "storage/indexfsm.h"
 #include "storage/ipc.h"
 #include "storage/lmgr.h"
@@ -85,7 +86,9 @@ btbuild(PG_FUNCTION_ARGS)
 	Relation	index = (Relation) PG_GETARG_POINTER(1);
 	IndexInfo  *indexInfo = (IndexInfo *) PG_GETARG_POINTER(2);
 	IndexBuildResult *result;
+	BlockNumber relpages;
 	double		reltuples;
+	double		allvisfrac;
 	BTBuildState buildstate;
 
 	buildstate.isUnique = indexInfo->ii_Unique;
@@ -100,6 +103,9 @@ btbuild(PG_FUNCTION_ARGS)
 		ResetUsage();
 #endif   /* BTREE_BUILD_STATS */
 
+	/* Estimate the number of rows currently present in the table */
+	estimate_rel_size(heap, NULL, &relpages, &reltuples, &allvisfrac);
+
 	/*
 	 * We expect to be called exactly once for any index relation. If that's
 	 * not the case, big trouble's what we have.
@@ -108,14 +114,15 @@ btbuild(PG_FUNCTION_ARGS)
 		elog(ERROR, "index \"%s\" already contains data",
 			 RelationGetRelationName(index));
 
-	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false);
+	buildstate.spool = _bt_spoolinit(heap, index, indexInfo->ii_Unique, false,
+									 reltuples);
 
 	/*
 	 * If building a unique index, put dead tuples in a second spool to keep
 	 * them out of the uniqueness check.
 	 */
 	if (indexInfo->ii_Unique)
-		buildstate.spool2 = _bt_spoolinit(heap, index, false, true);
+		buildstate.spool2 = _bt_spoolinit(heap, index, false, true, reltuples);
 
 	/* do the heap scan */
 	reltuples = IndexBuildHeapScan(heap, index, indexInfo, true,
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..0d4a5ea 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -149,7 +149,8 @@ static void _bt_load(BTWriteState *wstate,
  * create and initialize a spool structure
  */
 BTSpool *
-_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
+_bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead,
+			  double reltuples)
 {
 	BTSpool    *btspool = (BTSpool *) palloc0(sizeof(BTSpool));
 	int			btKbytes;
@@ -165,10 +166,15 @@ _bt_spoolinit(Relation heap, Relation index, bool isunique, bool isdead)
 	 * unique index actually requires two BTSpool objects.  We expect that the
 	 * second one (for dead tuples) won't get very full, so we give it only
 	 * work_mem.
+	 *
+	 * reltuples hint does not account for factors like whether or not this is
+	 * a partial index, or if this is second BTSpool object, because it seems
+	 * more conservative to estimate high.
 	 */
 	btKbytes = isdead ? work_mem : maintenance_work_mem;
 	btspool->sortstate = tuplesort_begin_index_btree(heap, index, isunique,
-													 btKbytes, false);
+													 btKbytes, reltuples,
+													 false);
 
 	return btspool;
 }
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index e59b163..88ee81d 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -2835,6 +2835,7 @@ validate_index(Oid heapId, Oid indexId, Snapshot snapshot)
 	state.tuplesort = tuplesort_begin_datum(TIDOID, TIDLessOperator,
 											InvalidOid, false,
 											maintenance_work_mem,
+											ivinfo.num_heap_tuples,
 											false);
 	state.htups = state.itups = state.tups_inserted = 0;
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index 7ab4874..23f6459 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -891,7 +891,9 @@ copy_heap_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose,
 	/* Set up sorting if wanted */
 	if (use_sort)
 		tuplesort = tuplesort_begin_cluster(oldTupDesc, OldIndex,
-											maintenance_work_mem, false);
+											maintenance_work_mem,
+											OldHeap->rd_rel->reltuples,
+											false);
 	else
 		tuplesort = NULL;
 
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 2e36855..f580cca 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -520,6 +520,7 @@ initialize_phase(AggState *aggstate, int newphase)
 												  sortnode->collations,
 												  sortnode->nullsFirst,
 												  work_mem,
+												  sortnode->plan.plan_rows,
 												  false);
 	}
 
@@ -588,7 +589,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									  pertrans->sortOperators[0],
 									  pertrans->sortCollations[0],
 									  pertrans->sortNullsFirst[0],
-									  work_mem, false);
+									  work_mem, agg_input_rows(aggstate),
+									  false);
 		else
 			pertrans->sortstates[aggstate->current_set] =
 				tuplesort_begin_heap(pertrans->evaldesc,
@@ -597,7 +599,8 @@ initialize_aggregate(AggState *aggstate, AggStatePerTrans pertrans,
 									 pertrans->sortOperators,
 									 pertrans->sortCollations,
 									 pertrans->sortNullsFirst,
-									 work_mem, false);
+									 work_mem, agg_input_rows(aggstate),
+									 false);
 	}
 
 	/*
@@ -1439,6 +1442,25 @@ find_hash_columns(AggState *aggstate)
 }
 
 /*
+ * Estimate the number of rows input to the sorter.
+ *
+ * Exported for use by ordered-set aggregates.
+ */
+double
+agg_input_rows(AggState *aggstate)
+{
+	Plan	   *outerNode;
+
+	/*
+	 * Get information about the size of the relation to be sorted (it's the
+	 * "outer" subtree of this node)
+	 */
+	outerNode = outerPlanState(aggstate)->plan;
+
+	return outerNode->plan_rows;
+}
+
+/*
  * Estimate per-hash-table-entry overhead for the planner.
  *
  * Note that the estimate does not include space for pass-by-reference
diff --git a/src/backend/executor/nodeSort.c b/src/backend/executor/nodeSort.c
index af1dccf..e4b1104 100644
--- a/src/backend/executor/nodeSort.c
+++ b/src/backend/executor/nodeSort.c
@@ -89,6 +89,7 @@ ExecSort(SortState *node)
 											  plannode->collations,
 											  plannode->nullsFirst,
 											  work_mem,
+											  plannode->plan.plan_rows,
 											  node->randomAccess);
 		if (node->bounded)
 			tuplesort_set_bound(tuplesortstate, node->bound);
diff --git a/src/backend/utils/adt/orderedsetaggs.c b/src/backend/utils/adt/orderedsetaggs.c
index 39ed85b..b51a945 100644
--- a/src/backend/utils/adt/orderedsetaggs.c
+++ b/src/backend/utils/adt/orderedsetaggs.c
@@ -20,6 +20,7 @@
 #include "catalog/pg_operator.h"
 #include "catalog/pg_type.h"
 #include "executor/executor.h"
+#include "executor/nodeAgg.h"
 #include "miscadmin.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/tlist.h"
@@ -103,6 +104,7 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 {
 	OSAPerGroupState *osastate;
 	OSAPerQueryState *qstate;
+	AggState		 *aggstate;
 	MemoryContext gcontext;
 	MemoryContext oldcontext;
 
@@ -117,8 +119,11 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 	/*
 	 * We keep a link to the per-query state in fn_extra; if it's not there,
 	 * create it, and do the per-query setup we need.
+	 *
+	 * aggstate is used to get hint on total number of tuples for tuplesort.
 	 */
 	qstate = (OSAPerQueryState *) fcinfo->flinfo->fn_extra;
+	aggstate = (AggState *) fcinfo->context;
 	if (qstate == NULL)
 	{
 		Aggref	   *aggref;
@@ -276,13 +281,17 @@ ordered_set_startup(FunctionCallInfo fcinfo, bool use_tuples)
 												   qstate->sortOperators,
 												   qstate->sortCollations,
 												   qstate->sortNullsFirsts,
-												   work_mem, false);
+												   work_mem,
+												   agg_input_rows(aggstate),
+												   false);
 	else
 		osastate->sortstate = tuplesort_begin_datum(qstate->sortColType,
 													qstate->sortOperator,
 													qstate->sortCollation,
 													qstate->sortNullsFirst,
-													work_mem, false);
+													work_mem,
+													agg_input_rows(aggstate),
+													false);
 
 	osastate->number_of_rows = 0;
 
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index fc4ac90..6d766d2 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -13,11 +13,13 @@
  * See Knuth, volume 3, for more than you want to know about the external
  * sorting algorithm.  We divide the input into sorted runs using replacement
  * selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H -- although that strategy can be
- * abandoned where it does not appear to help), then merge the runs using
- * polyphase merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by
- * Algorithm D are implemented by logtape.c, which avoids space wastage by
- * recycling disk space as soon as each block is read from its "tape".
+ * (essentially his Algorithm 5.2.3H -- although that strategy is often
+ * avoided altogether), then merge the runs using polyphase merge, Knuth's
+ * Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D are
+ * implemented by logtape.c, which avoids space wastage by recycling disk
+ * space as soon as each block is read from its "tape".  Note that a hybrid
+ * sort-merge strategy is usually used in practice, because maintaining a
+ * priority tree/heap is expensive.
  *
  * We do not form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
@@ -108,10 +110,13 @@
  * If, having maintained a replacement selection priority queue (heap) for
  * the first run it transpires that there will be multiple on-tape runs
  * anyway, we abandon treating memtuples as a heap, and quicksort and write
- * in memtuples-sized batches.  This gives us most of the advantages of
- * always quicksorting and batch dumping runs, which can perform much better
- * than heap sorting and incrementally spilling tuples, without giving up on
- * replacement selection in cases where it remains compelling.
+ * in memtuples-sized batches.  This allows a "quicksort with spillover" to
+ * occur, but that remains about the only truly compelling case for
+ * replacement selection.  Callers provides a hint for the total number of
+ * rows, used to avoid replacement selection when a "quicksort with
+ * spillover" is not anticipated -- see useselection().  A hybrid sort-merge
+ * strategy can be much faster for very large inputs when replacement
+ * selection is never attempted.
  *
  *
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
@@ -245,6 +250,7 @@ struct Tuplesortstate
 {
 	TupSortStatus status;		/* enumerated value as shown above */
 	int			nKeys;			/* number of columns in sort key */
+	double		rowNumHint;		/* caller's hint of total # of rows */
 	bool		randomAccess;	/* did caller request random access? */
 	bool		bounded;		/* did caller specify a maximum number of
 								 * tuples to return? */
@@ -313,7 +319,9 @@ struct Tuplesortstate
 	/*
 	 * While building initial runs, this indicates if the replacement
 	 * selection strategy or simple hybrid sort-merge strategy is in use.
-	 * Replacement selection is abandoned after first run.
+	 * Replacement selection may be determined to not be effective ahead of
+	 * time, based on a caller-supplied hint.  Otherwise, it is abandoned
+	 * after first run.
 	 */
 	bool		replaceActive;
 
@@ -505,9 +513,11 @@ struct Tuplesortstate
 	} while(0)
 
 
-static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
+static Tuplesortstate *tuplesort_begin_common(int workMem, double rowNumHint,
+									  bool randomAccess);
 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
 static bool consider_abort_common(Tuplesortstate *state);
+static bool useselection(Tuplesortstate *state);
 static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
@@ -584,12 +594,14 @@ static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
  * Each variant of tuplesort_begin has a workMem parameter specifying the
  * maximum number of kilobytes of RAM to use before spilling data to disk.
  * (The normal value of this parameter is work_mem, but some callers use
- * other values.)  Each variant also has a randomAccess parameter specifying
- * whether the caller needs non-sequential access to the sort result.
+ * other values.)  Each variant also has a hint parameter of the total
+ * number of rows to be sorted, and a randomAccess parameter specifying
+ * whether the caller needs non-sequential access to the sort result.  Since
+ * rowNumHint is just a hint, it's acceptable for it to be zero or negative.
  */
 
 static Tuplesortstate *
-tuplesort_begin_common(int workMem, bool randomAccess)
+tuplesort_begin_common(int workMem, double rowNumHint, bool randomAccess)
 {
 	Tuplesortstate *state;
 	MemoryContext sortcontext;
@@ -619,6 +631,7 @@ tuplesort_begin_common(int workMem, bool randomAccess)
 #endif
 
 	state->status = TSS_INITIAL;
+	state->rowNumHint = rowNumHint;
 	state->randomAccess = randomAccess;
 	state->bounded = false;
 	state->boundUsed = false;
@@ -664,9 +677,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess)
+					 int workMem, double rowNumHint,
+					 bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	MemoryContext oldcontext;
 	int			i;
 
@@ -734,9 +749,11 @@ tuplesort_begin_heap(TupleDesc tupDesc,
 Tuplesortstate *
 tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
-						int workMem, bool randomAccess)
+						int workMem,
+						double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	ScanKey		indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -827,9 +844,11 @@ Tuplesortstate *
 tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
-							int workMem, bool randomAccess)
+							int workMem,
+							double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	ScanKey		indexScanKey;
 	MemoryContext oldcontext;
 	int			i;
@@ -902,9 +921,11 @@ Tuplesortstate *
 tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
 						   uint32 hash_mask,
-						   int workMem, bool randomAccess)
+						   int workMem,
+						   double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	MemoryContext oldcontext;
 
 	oldcontext = MemoryContextSwitchTo(state->sortcontext);
@@ -937,9 +958,10 @@ tuplesort_begin_index_hash(Relation heapRel,
 Tuplesortstate *
 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess)
+					  int workMem, double rowNumHint, bool randomAccess)
 {
-	Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
+	Tuplesortstate *state = tuplesort_begin_common(workMem, rowNumHint,
+												   randomAccess);
 	MemoryContext oldcontext;
 	int16		typlen;
 	bool		typbyval;
@@ -2270,6 +2292,73 @@ tuplesort_merge_order(int64 allowedMem)
 }
 
 /*
+ * useselection - determine if one replacement selection run should be
+ * attempted.
+ *
+ * This is called when we just ran out of memory, and must consider costs
+ * and benefits of replacement selection for first run, which can result in
+ * a "quicksort with spillover".  Note that replacement selection is always
+ * abandoned after the first run.
+ */
+static bool
+useselection(Tuplesortstate *state)
+{
+	int64		memNowUsed = state->allowedMem - state->availMem;
+	double		avgTupleSize;
+	int			increments;
+	double		crossover;
+	bool		useSelection;
+
+	/* For randomAccess callers, "quicksort with spillover" is never used */
+	if (state->randomAccess)
+		return false;
+
+	/*
+	 * Crossover point is somewhere between where memtuples is between 40%
+	 * and all-but-one of total tuples to sort.  This weighs approximate
+	 * savings in I/O, against generic heap sorting cost.
+	 */
+	avgTupleSize = (double) memNowUsed / (double) state->memtupsize;
+
+	/*
+	 * Starting from a threshold of 90%, refund 7.5% per 32 byte
+	 * average-size-increment.
+	 */
+	increments = MAXALIGN_DOWN((int) avgTupleSize) / 32;
+	crossover = 0.90 - (increments * 0.075);
+
+	/*
+	 * Clamp, making either outcome possible regardless of average size.
+	 *
+	 * 40% is about the minimum point at which "quicksort with spillover"
+	 * can still occur without a logical/physical correlation.
+	 */
+	crossover = Max(0.40, Min(crossover, 0.85));
+
+	/*
+	 * The point where the overhead of maintaining the heap invariant is
+	 * likely to dominate over any saving in I/O is somewhat arbitrarily
+	 * assumed to be the point where memtuples' size exceeds MaxAllocSize
+	 * (note that overall memory consumption may be far greater).  Past this
+	 * point, only the most compelling cases use replacement selection for
+	 * their first run.
+	 */
+	if (sizeof(SortTuple) * state->memtupcount > MaxAllocSize)
+		crossover = avgTupleSize > 32 ? 0.90 : 0.95;
+
+	useSelection = state->memtupcount > state->rowNumHint * crossover;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "%s in use from row %d with %.2f total rows %.3f crossover",
+			 useSelection? "replacement selection" : "hybrid sort-merge",
+			 state->memtupcount, state->rowNumHint, crossover);
+#endif
+
+	return useSelection;
+}
+
+/*
  * inittapes - initialize for tape sorting.
  *
  * This is called only if we have found we don't have room to sort in memory.
@@ -2278,7 +2367,6 @@ static void
 inittapes(Tuplesortstate *state)
 {
 	int			maxTapes,
-				ntuples,
 				j;
 	int64		tapeSpace;
 
@@ -2337,32 +2425,38 @@ inittapes(Tuplesortstate *state)
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
 
 	/*
-	 * Give replacement selection a try.  There will be a switch to a simple
-	 * hybrid sort-merge strategy after the first run (iff there is to be a
-	 * second on-tape run).
+	 * Give replacement selection a try when number of tuples to be sorted
+	 * has a reasonable chance of enabling a "quicksort with spillover".
+	 * There will be a switch to a simple hybrid sort-merge strategy after
+	 * the first run (iff there is to be a second on-tape run).
 	 */
-	state->replaceActive = true;
+	state->replaceActive = useselection(state);
 	state->cached = false;
 	state->just_memtuples = false;
 
-	/*
-	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
-	 * marked as belonging to run number zero.
-	 *
-	 * NOTE: we pass false for checkIndex since there's no point in comparing
-	 * indexes in this step, even though we do intend the indexes to be part
-	 * of the sort key...
-	 */
-	ntuples = state->memtupcount;
-	state->memtupcount = 0;		/* make the heap empty */
-	for (j = 0; j < ntuples; j++)
+	if (state->replaceActive)
 	{
-		/* Must copy source tuple to avoid possible overwrite */
-		SortTuple	stup = state->memtuples[j];
+		/*
+		 * Convert the unsorted contents of memtuples[] into a heap. Each
+		 * tuple is marked as belonging to run number zero.
+		 *
+		 * NOTE: we pass false for checkIndex since there's no point in
+		 * comparing indexes in this step, even though we do intend the
+		 * indexes to be part of the sort key...
+		 */
+		int			ntuples = state->memtupcount;
 
-		tuplesort_heap_insert(state, &stup, 0, false);
+		state->memtupcount = 0;		/* make the heap empty */
+
+		for (j = 0; j < ntuples; j++)
+		{
+			/* Must copy source tuple to avoid possible overwrite */
+			SortTuple	stup = state->memtuples[j];
+
+			tuplesort_heap_insert(state, &stup, 0, false);
+		}
+		Assert(state->memtupcount == ntuples);
 	}
-	Assert(state->memtupcount == ntuples);
 
 	state->currentRun = 0;
 
diff --git a/src/include/access/hash.h b/src/include/access/hash.h
index 97cb859..95acc1d 100644
--- a/src/include/access/hash.h
+++ b/src/include/access/hash.h
@@ -335,7 +335,8 @@ extern bool _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir);
 /* hashsort.c */
 typedef struct HSpool HSpool;	/* opaque struct in hashsort.c */
 
-extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets);
+extern HSpool *_h_spoolinit(Relation heap, Relation index, uint32 num_buckets,
+				 double reltuples);
 extern void _h_spooldestroy(HSpool *hspool);
 extern void _h_spool(HSpool *hspool, ItemPointer self,
 		 Datum *values, bool *isnull);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9e48efd..5504b7b 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -743,7 +743,7 @@ extern void BTreeShmemInit(void);
 typedef struct BTSpool BTSpool; /* opaque type known only within nbtsort.c */
 
 extern BTSpool *_bt_spoolinit(Relation heap, Relation index,
-			  bool isunique, bool isdead);
+			  bool isunique, bool isdead, double reltuples);
 extern void _bt_spooldestroy(BTSpool *btspool);
 extern void _bt_spool(BTSpool *btspool, ItemPointer self,
 		  Datum *values, bool *isnull);
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index fe3b81a..e6144f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -21,6 +21,8 @@ extern TupleTableSlot *ExecAgg(AggState *node);
 extern void ExecEndAgg(AggState *node);
 extern void ExecReScanAgg(AggState *node);
 
+extern double agg_input_rows(AggState *aggstate);
+
 extern Size hash_agg_entry_size(int numAggs);
 
 extern Datum aggregate_dummy(PG_FUNCTION_ARGS);
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index 3679815..11a5fb7 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -62,22 +62,27 @@ extern Tuplesortstate *tuplesort_begin_heap(TupleDesc tupDesc,
 					 int nkeys, AttrNumber *attNums,
 					 Oid *sortOperators, Oid *sortCollations,
 					 bool *nullsFirstFlags,
-					 int workMem, bool randomAccess);
+					 int workMem,
+					 double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_cluster(TupleDesc tupDesc,
 						Relation indexRel,
-						int workMem, bool randomAccess);
+						int workMem,
+						double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_btree(Relation heapRel,
 							Relation indexRel,
 							bool enforceUnique,
-							int workMem, bool randomAccess);
+							int workMem,
+							double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_index_hash(Relation heapRel,
 						   Relation indexRel,
 						   uint32 hash_mask,
-						   int workMem, bool randomAccess);
+						   int workMem,
+						   double rowNumHint, bool randomAccess);
 extern Tuplesortstate *tuplesort_begin_datum(Oid datumType,
 					  Oid sortOperator, Oid sortCollation,
 					  bool nullsFirstFlag,
-					  int workMem, bool randomAccess);
+					  int workMem,
+					  double rowNumHint, bool randomAccess);
 
 extern void tuplesort_set_bound(Tuplesortstate *state, int64 bound);
 
-- 
1.9.1

From 1f5dd12fa0bf632598ca1e7e890a7ee581af9a9b Mon Sep 17 00:00:00 2001
From: Peter Geoghegan <peter.geoghega...@gmail.com>
Date: Wed, 29 Jul 2015 15:38:12 -0700
Subject: [PATCH 1/5] Quicksort when performing external sorts

Add "quicksort with spillover".  This allows an external sort that is
within about 2x of work_mem to avoid writing out most tuples, and to
quicksort perhaps almost all tuples rather than performing a degenerate
heapsort.  Often, an external sort is now only marginally more expensive
than an internal sort, which is a significant improvement, and sort
performance is made much more predictable.

In addition, have tuplesort give up on replacement selection when it
appears to not be promising.  Most of the benefits of replacement
selection are seen where only one (or at most two) runs are ultimately
produced, where that would not occur with a simple hybrid sort-merge
strategy, or where incremental spilling rather than spilling in batches
allows a "quicksort with spillover" to ultimately write almost no tuples
out.  It can be helpful to only produce one run in the common case where
input is already in mostly sorted order.

These cases are important, so replacement selection is still relevant.
However, since, in general, maintaining runs as a heap has been shown to
interact very negatively with modern CPUs with fast caches, it doesn't
seem worth holding on when a non-sympathetic case for replacement
selection is encountered.  Therefore, when a second or subsequent run is
necessary (rather than preliminarily appearing necessary, something a
"quicksort with spillover" is often able to safely disregard), the
second and subsequent runs are also quicksorted, but dumped in batch.
Testing has shown this to be much faster in many realistic cases,
although there is no saving in I/O.

The replacement selection run-building heap is maintained inexpensively.
There is no need to distinguish between tuples that belong to the second
run as the heap property is initially maintained (they are destined to
be quicksorted along with any first run tuples still in memory -- this
is effectively an all-in-memory merge), and so second-or-subsequent-run
tuples are appended to the end of memtuples indifferently during tuple
copying, which is cache friendly.  After the last tuple from the first
(on-tape) run must be dumped incrementally to the first tape, memtuples
ceases to be a heap, and although I/O cannot be avoided, everything is
still quicksorted (and dumped in batch).
---
 src/backend/commands/explain.c     |  13 +-
 src/backend/utils/sort/tuplesort.c | 544 +++++++++++++++++++++++++++++++++----
 src/include/utils/tuplesort.h      |   3 +-
 3 files changed, 496 insertions(+), 64 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5d06fa4..94b1f77 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2178,20 +2178,27 @@ show_sort_info(SortState *sortstate, ExplainState *es)
 		const char *sortMethod;
 		const char *spaceType;
 		long		spaceUsed;
+		int			rowsSortedMem;
 
-		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed);
+		tuplesort_get_stats(state, &sortMethod, &spaceType, &spaceUsed,
+							&rowsSortedMem);
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
 		{
 			appendStringInfoSpaces(es->str, es->indent * 2);
-			appendStringInfo(es->str, "Sort Method: %s  %s: %ldkB\n",
-							 sortMethod, spaceType, spaceUsed);
+			appendStringInfo(es->str,
+							 "Sort Method: %s  %s: %ldkB  Rows In Memory: %d\n",
+							 sortMethod,
+							 spaceType,
+							 spaceUsed,
+							 rowsSortedMem);
 		}
 		else
 		{
 			ExplainPropertyText("Sort Method", sortMethod, es);
 			ExplainPropertyLong("Sort Space Used", spaceUsed, es);
 			ExplainPropertyText("Sort Space Type", spaceType, es);
+			ExplainPropertyInteger("Rows In Memory", rowsSortedMem, es);
 		}
 	}
 }
diff --git a/src/backend/utils/sort/tuplesort.c b/src/backend/utils/sort/tuplesort.c
index d532e87..fc4ac90 100644
--- a/src/backend/utils/sort/tuplesort.c
+++ b/src/backend/utils/sort/tuplesort.c
@@ -13,10 +13,11 @@
  * See Knuth, volume 3, for more than you want to know about the external
  * sorting algorithm.  We divide the input into sorted runs using replacement
  * selection, in the form of a priority tree implemented as a heap
- * (essentially his Algorithm 5.2.3H), then merge the runs using polyphase
- * merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by Algorithm D
- * are implemented by logtape.c, which avoids space wastage by recycling
- * disk space as soon as each block is read from its "tape".
+ * (essentially his Algorithm 5.2.3H -- although that strategy can be
+ * abandoned where it does not appear to help), then merge the runs using
+ * polyphase merge, Knuth's Algorithm 5.4.2D.  The logical "tapes" used by
+ * Algorithm D are implemented by logtape.c, which avoids space wastage by
+ * recycling disk space as soon as each block is read from its "tape".
  *
  * We do not form the initial runs using Knuth's recommended replacement
  * selection data structure (Algorithm 5.4.1R), because it uses a fixed
@@ -72,6 +73,15 @@
  * to one run per logical tape.  The final merge is then performed
  * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
  * saves one cycle of writing all the data out to disk and reading it in.
+ * Also, if only one run is spilled to tape so far when
+ * tuplesort_performsort() is reached, and if the caller does not require
+ * random access, then the merge step can take place between still
+ * in-memory tuples, and tuples stored on tape (it does not matter that
+ * there may be a second run in that array -- only that a second one has
+ * spilled).  This ensures that spilling to disk only occurs for a number of
+ * tuples approximately equal to the number of tuples read in after
+ * work_mem was reached and it became apparent that an external sort is
+ * required.
  *
  * Before Postgres 8.2, we always used a seven-tape polyphase merge, on the
  * grounds that 7 is the "sweet spot" on the tapes-to-passes curve according
@@ -86,6 +96,23 @@
  * we preread from a tape, so as to maintain the locality of access described
  * above.  Nonetheless, with large workMem we can have many tapes.
  *
+ * Before Postgres 9.6, we always used a heap for replacement selection when
+ * building runs.  However, Knuth does not consider the influence of memory
+ * access on overall performance, which is a crucial consideration on modern
+ * machines;  replacement selection is only really of value where a single
+ * run or two runs can be produced, sometimes avoiding a merge step
+ * entirely.  Replacement selection makes this likely when tuples are read
+ * in approximately logical order, even if work_mem is only a small fraction
+ * of the requirement for an internal sort, but large main memory sizes
+ * don't benefit from tiny, incremental spills, even with enormous datasets.
+ * If, having maintained a replacement selection priority queue (heap) for
+ * the first run it transpires that there will be multiple on-tape runs
+ * anyway, we abandon treating memtuples as a heap, and quicksort and write
+ * in memtuples-sized batches.  This gives us most of the advantages of
+ * always quicksorting and batch dumping runs, which can perform much better
+ * than heap sorting and incrementally spilling tuples, without giving up on
+ * replacement selection in cases where it remains compelling.
+ *
  *
  * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -160,7 +187,10 @@ bool		optimize_bounded_sort = true;
  * described above.  Accordingly, "tuple" is always used in preference to
  * datum1 as the authoritative value for pass-by-reference cases.
  *
- * While building initial runs, tupindex holds the tuple's run number.  During
+ * While building initial runs, tupindex holds the tuple's run number.
+ * Historically, the run number could meaningfully distinguish many runs, but
+ * currently it only meaningfully distinguishes the first run with any other
+ * run, since replacement selection is abandoned after the first run.  During
  * merge passes, we re-use it to hold the input tape number that each tuple in
  * the heap was read from, or to hold the index of the next tuple pre-read
  * from the same tape in the case of pre-read entries.  tupindex goes unused
@@ -186,6 +216,7 @@ typedef enum
 	TSS_BUILDRUNS,				/* Loading tuples; writing to tape */
 	TSS_SORTEDINMEM,			/* Sort completed entirely in memory */
 	TSS_SORTEDONTAPE,			/* Sort completed, final run is on tape */
+	TSS_MEMTAPEMERGE,			/* Performing memory/tape merge on-the-fly */
 	TSS_FINALMERGE				/* Performing final merge on-the-fly */
 } TupSortStatus;
 
@@ -280,6 +311,13 @@ struct Tuplesortstate
 	bool		growmemtuples;	/* memtuples' growth still underway? */
 
 	/*
+	 * While building initial runs, this indicates if the replacement
+	 * selection strategy or simple hybrid sort-merge strategy is in use.
+	 * Replacement selection is abandoned after first run.
+	 */
+	bool		replaceActive;
+
+	/*
 	 * While building initial runs, this is the current output run number
 	 * (starting at 0).  Afterwards, it is the number of initial runs we made.
 	 */
@@ -327,12 +365,22 @@ struct Tuplesortstate
 	int			activeTapes;	/* # of active input tapes in merge pass */
 
 	/*
+	 * These variables are used after tuplesort_performsort() for the
+	 * TSS_MEMTAPEMERGE case.  This is a special, optimized final on-the-fly
+	 * merge pass involving merging the result tape with memtuples that were
+	 * quicksorted (but never made it out to a tape).
+	 */
+	SortTuple	tape_cache;		/* cached tape tuple from prior call */
+	bool		cached;			/* tape_cache holds pending tape tuple */
+	bool		just_memtuples;	/* merge only fetching from memtuples */
+
+	/*
 	 * These variables are used after completion of sorting to keep track of
 	 * the next tuple to return.  (In the tape case, the tape's current read
 	 * position is also critical state.)
 	 */
 	int			result_tape;	/* actual tape number of finished output */
-	int			current;		/* array index (only used if SORTEDINMEM) */
+	int			current;		/* memtuples array index */
 	bool		eof_reached;	/* reached EOF (needed for cursors) */
 
 	/* markpos_xxx holds marked position for mark and restore */
@@ -464,12 +512,15 @@ static void inittapes(Tuplesortstate *state);
 static void selectnewtape(Tuplesortstate *state);
 static void mergeruns(Tuplesortstate *state);
 static void mergeonerun(Tuplesortstate *state);
+static void mergememruns(Tuplesortstate *state);
 static void beginmerge(Tuplesortstate *state);
 static void mergepreread(Tuplesortstate *state);
 static void mergeprereadone(Tuplesortstate *state, int srcTape);
 static void dumptuples(Tuplesortstate *state, bool alltuples);
+static void dumpbatch(Tuplesortstate *state, bool alltuples);
 static void make_bounded_heap(Tuplesortstate *state);
 static void sort_bounded_heap(Tuplesortstate *state);
+static void tuplesort_quicksort(Tuplesortstate *state);
 static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 					  int tupleindex, bool checkIndex);
 static void tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex);
@@ -1486,22 +1537,60 @@ puttuple_common(Tuplesortstate *state, SortTuple *tuple)
 
 			/*
 			 * Insert the tuple into the heap, with run number currentRun if
-			 * it can go into the current run, else run number currentRun+1.
-			 * The tuple can go into the current run if it is >= the first
-			 * not-yet-output tuple.  (Actually, it could go into the current
-			 * run if it is >= the most recently output tuple ... but that
-			 * would require keeping around the tuple we last output, and it's
-			 * simplest to let writetup free each tuple as soon as it's
-			 * written.)
+			 * it can go into the current run, else run number INT_MAX (some
+			 * later run).  The tuple can go into the current run if it is
+			 * >= the first not-yet-output tuple.  (Actually, it could go
+			 * into the current run if it is >= the most recently output
+			 * tuple ... but that would require keeping around the tuple we
+			 * last output, and it's simplest to let writetup free each
+			 * tuple as soon as it's written.)
 			 *
-			 * Note there will always be at least one tuple in the heap at
-			 * this point; see dumptuples.
+			 * Note that this only applies if the currentRun is 0 (prior to
+			 * giving up on heapification).  There is no meaningful
+			 * distinction between any two runs in memory except the first
+			 * and second run.  When the currentRun is not 0, there is no
+			 * guarantee that any tuples are already stored in memory here,
+			 * and if there are any they're in no significant order.
 			 */
-			Assert(state->memtupcount > 0);
-			if (COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			if (state->replaceActive &&
+				COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
+			{
+				/*
+				 * Unlike classic replacement selection, which this module was
+				 * previously based on, only run 0 is treated as a priority
+				 * queue through heapification.  The second run (run 1) is
+				 * appended indifferently below, and will never be trusted to
+				 * maintain the heap invariant beyond simply not getting in
+				 * the way of spilling run 0 incrementally.  In other words,
+				 * second run tuples may be sifted out of the way of first
+				 * run tuples; COMPARETUP() will never be called for run
+				 * 1 tuples.  However, not even HEAPCOMPARE() will be
+				 * called for a subsequent run's tuples.
+				 */
 				tuplesort_heap_insert(state, tuple, state->currentRun, true);
+			}
 			else
-				tuplesort_heap_insert(state, tuple, state->currentRun + 1, true);
+			{
+				/*
+				 * Note that unlike Knuth, we do not care about the second
+				 * run's tuples when loading runs.  After the first run is
+				 * complete, tuples will not be dumped incrementally at all,
+				 * but as long as the first run (run 0) is current it will
+				 * be maintained.  dumptuples does not trust that the second
+				 * or subsequent runs are heapified (beyond merely not
+				 * getting in the way of the first, fully heapified run,
+				 * which only matters for the second run, run 1).  Anything
+				 * past the first run will be quicksorted.
+				 *
+				 * Past the first run, there is no need to differentiate runs
+				 * in memory (only the first and second runs will ever be
+				 * usefully differentiated).  Use a generic INT_MAX run
+				 * number (just to be tidy).  There should always be room to
+				 * store the incoming tuple.
+				 */
+				tuple->tupindex = INT_MAX;
+				state->memtuples[state->memtupcount++] = *tuple;
+			}
 
 			/*
 			 * If we are over the memory limit, dump tuples till we're under.
@@ -1576,20 +1665,9 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * We were able to accumulate all the tuples within the allowed
-			 * amount of memory.  Just qsort 'em and we're done.
+			 * amount of memory.  Just quicksort 'em and we're done.
 			 */
-			if (state->memtupcount > 1)
-			{
-				/* Can we use the single-key sort function? */
-				if (state->onlyKey != NULL)
-					qsort_ssup(state->memtuples, state->memtupcount,
-							   state->onlyKey);
-				else
-					qsort_tuple(state->memtuples,
-								state->memtupcount,
-								state->comparetup,
-								state);
-			}
+			tuplesort_quicksort(state);
 			state->current = 0;
 			state->eof_reached = false;
 			state->markpos_offset = 0;
@@ -1616,12 +1694,26 @@ tuplesort_performsort(Tuplesortstate *state)
 
 			/*
 			 * Finish tape-based sort.  First, flush all tuples remaining in
-			 * memory out to tape; then merge until we have a single remaining
-			 * run (or, if !randomAccess, one run per tape). Note that
-			 * mergeruns sets the correct state->status.
+			 * memory out to tape where that's required (when more than one
+			 * run's tuples made it to tape, or when the caller required
+			 * random access).  Then, either merge until we have a single
+			 * remaining run on tape, or merge runs in memory by sorting
+			 * them into one single in-memory run.  Note that
+			 * mergeruns/mergememruns sets the correct state->status.
 			 */
-			dumptuples(state, true);
-			mergeruns(state);
+			if (state->currentRun > 0 || state->randomAccess)
+			{
+				dumptuples(state, true);
+				mergeruns(state);
+			}
+			else
+			{
+				/*
+				 * Only possible for !randomAccess callers, just as with
+				 * tape based on-the-fly merge
+				 */
+				mergememruns(state);
+			}
 			state->eof_reached = false;
 			state->markpos_block = 0L;
 			state->markpos_offset = 0;
@@ -1640,6 +1732,9 @@ tuplesort_performsort(Tuplesortstate *state)
 			elog(LOG, "performsort done (except %d-way final merge): %s",
 				 state->activeTapes,
 				 pg_rusage_show(&state->ru_start));
+		else if (state->status == TSS_MEMTAPEMERGE)
+			elog(LOG, "performsort done (except memory/tape final merge): %s",
+				 pg_rusage_show(&state->ru_start));
 		else
 			elog(LOG, "performsort done: %s",
 				 pg_rusage_show(&state->ru_start));
@@ -1791,6 +1886,118 @@ tuplesort_gettuple_common(Tuplesortstate *state, bool forward,
 			READTUP(state, stup, state->result_tape, tuplen);
 			return true;
 
+		case TSS_MEMTAPEMERGE:
+			Assert(forward);
+			/* For now, assume tuple returned from memory */
+			*should_free = false;
+
+			/*
+			 * Should be at least one memtuple (work_mem should be roughly
+			 * fully consumed)
+			 */
+			Assert(state->memtupcount > 0);
+
+			if (state->eof_reached)
+				return false;
+
+			if (state->just_memtuples)
+				goto just_memtuples;
+
+			/*
+			 * Merge together quicksorted memtuples array, and sorted tape.
+			 *
+			 * When this optimization was initially applied, the array was
+			 * heapified.  Some number of tuples were spilled to disk from the
+			 * top of the heap irregularly, and are read from tape here in
+			 * fully sorted order.  memtuples usually originally contains 2
+			 * runs, though, so we merge it with the on-tape run.
+			 * (Quicksorting effectively merged the 2 in-memory runs into one
+			 * in-memory run already)
+			 *
+			 * Exhaust the supply of tape tuples first.
+			 *
+			 * "stup" is always initially set to the current tape tuple if
+			 * any remain, which may be cached from previous call, or read
+			 * from tape when nothing cached.
+			 */
+			if (state->cached)
+				*stup = state->tape_cache;
+			else if ((tuplen = getlen(state, state->result_tape, true)) != 0)
+				READTUP(state, stup, state->result_tape, tuplen);
+			else
+			{
+				/* Supply of tape tuples was just exhausted */
+				state->just_memtuples = true;
+				goto just_memtuples;
+			}
+
+			/*
+			 * Kludge:  Trigger abbreviated tie-breaker if in-memory tuples
+			 * use abbreviation (writing tuples to tape never preserves
+			 * abbreviated keys).  Do this by assigning in-memory
+			 * abbreviated tuple to tape tuple directly.
+			 *
+			 * It doesn't seem worth generating a new abbreviated key for
+			 * the tape tuple, and this approach is simpler than
+			 * "unabbreviating" the memtuple tuple from a "common" routine
+			 * like this.
+			 *
+			 * In the future, this routine could offer an API that allows
+			 * certain clients (like ordered set aggregate callers) to
+			 * cheaply test *inequality* across adjacent pairs of sorted
+			 * tuples on the basis of simple abbreviated key binary
+			 * inequality.  Another advantage of this approach is that that
+			 * can still work without reporting to clients that abbreviation
+			 * wasn't used.  The tape tuples might only be a small minority
+			 * of all tuples returned.
+			 */
+			if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
+				stup->datum1 = state->memtuples[state->current].datum1;
+
+			/*
+			 * Compare current tape tuple to current memtuple.
+			 *
+			 * Since we always start with at least one memtuple, and since tape
+			 * tuples are always returned before equal memtuples, it follows
+			 * that there must be at least one memtuple left to return here.
+			 */
+			Assert(state->current < state->memtupcount);
+
+			if (COMPARETUP(state, stup, &state->memtuples[state->current]) <= 0)
+			{
+				/*
+				 * Tape tuple less than or equal to memtuple array current
+				 * position.  Return it.
+				 */
+				state->cached = false;
+				/* Caller can free tape tuple memory */
+				*should_free = true;
+			}
+			else
+			{
+				/*
+				 * Tape tuple greater than memtuple array's current tuple.
+				 *
+				 * Return current memtuple tuple, and cache tape tuple for
+				 * next call.  It will be returned on next or subsequent
+				 * call.
+				 */
+				state->tape_cache = *stup;
+				state->cached = true;
+				*stup = state->memtuples[state->current++];
+			}
+			return true;
+
+just_memtuples:
+			/* Just return memtuples -- merging done */
+			if (state->current < state->memtupcount)
+			{
+				*stup = state->memtuples[state->current++];
+				return true;
+			}
+			state->eof_reached = true;
+			return false;
+
 		case TSS_FINALMERGE:
 			Assert(forward);
 			*should_free = true;
@@ -2000,6 +2207,7 @@ tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
 			return false;
 
 		case TSS_SORTEDONTAPE:
+		case TSS_MEMTAPEMERGE:
 		case TSS_FINALMERGE:
 
 			/*
@@ -2129,6 +2337,15 @@ inittapes(Tuplesortstate *state)
 	state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
 
 	/*
+	 * Give replacement selection a try.  There will be a switch to a simple
+	 * hybrid sort-merge strategy after the first run (iff there is to be a
+	 * second on-tape run).
+	 */
+	state->replaceActive = true;
+	state->cached = false;
+	state->just_memtuples = false;
+
+	/*
 	 * Convert the unsorted contents of memtuples[] into a heap. Each tuple is
 	 * marked as belonging to run number zero.
 	 *
@@ -2426,6 +2643,67 @@ mergeonerun(Tuplesortstate *state)
 }
 
 /*
+ * mergememruns -- merge runs in memory into a new in-memory run.
+ *
+ * This allows tuplesort to avoid dumping many tuples in the common case
+ * where work_mem is less than 2x the amount required for an internal sort
+ * ("quicksort with spillover").  This optimization does not appear in
+ * Knuth's algorithm.
+ *
+ * Merging here actually means quicksorting, without regard to the run
+ * number of each memtuple.  Note that this in-memory merge is distinct from
+ * the final on-the-fly merge step that follows.  This routine merges what
+ * was originally the second part of the first run with what was originally
+ * the entire second run in advance of the on-the-fly merge step (sometimes,
+ * there will only be one run in memory, but sorting is still required).
+ * The final on-the-fly merge occurs between the new all-in-memory run
+ * created by this routine, and what was originally the first part of the
+ * first run, which is already sorted on tape.
+ *
+ * The fact that the memtuples array has already been heapified (within the
+ * first run) is no reason to commit to the path of unnecessarily dumping
+ * and heapsorting input tuples.  Often, memtuples will be much larger than
+ * the final on-tape run, which is where this optimization is most
+ * effective.
+ */
+static void
+mergememruns(Tuplesortstate *state)
+{
+	Assert(state->replaceActive);
+	Assert(!state->randomAccess);
+
+	/*
+	 * It doesn't seem worth being more clever in the relatively rare case
+	 * where there was no second run pending (i.e. no tuple that didn't
+	 * belong to the original/first "currentRun") just to avoid an
+	 * on-the-fly merge step, although that might be possible with care.
+	 */
+	markrunend(state, state->currentRun);
+
+	/*
+	 * The usual path for quicksorting runs (quicksort just before dumping
+	 * all tuples) was avoided by caller, so quicksort to merge.
+	 *
+	 * Note that this may use abbreviated keys, which are no longer
+	 * available for the tuples that spilled to tape.  This is something
+	 * that the final on-the-fly merge step accounts for.
+	 */
+	tuplesort_quicksort(state);
+	state->current = 0;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksort of %d tuples to create single in-memory run: %s",
+			 state->memtupcount, pg_rusage_show(&state->ru_start));
+#endif
+
+	state->result_tape = state->tp_tapenum[state->destTape];
+	/* Must freeze and rewind the finished output tape */
+	LogicalTapeFreeze(state->tapeset, state->result_tape);
+	state->status = TSS_MEMTAPEMERGE;
+}
+
+/*
  * beginmerge - initialize for a merge pass
  *
  * We decrease the counts of real and dummy runs for each tape, and mark
@@ -2604,14 +2882,16 @@ mergeprereadone(Tuplesortstate *state, int srcTape)
 }
 
 /*
- * dumptuples - remove tuples from heap and write to tape
+ * dumptuples - remove tuples from memtuples and write to tape
  *
  * This is used during initial-run building, but not during merging.
  *
  * When alltuples = false, dump only enough tuples to get under the
- * availMem limit (and leave at least one tuple in the heap in any case,
- * since puttuple assumes it always has a tuple to compare to).  We also
- * insist there be at least one free slot in the memtuples[] array.
+ * availMem limit (and leave at least one tuple in memtuples where necessary,
+ * since puttuple sometimes assumes it is a heap that has a tuple to compare
+ * to, and the final on-the-fly in-memory merge requires one in-memory tuple at
+ * a minimum).  We also insist there be at least one free slot in the
+ * memtuples[] array.
  *
  * When alltuples = true, dump everything currently in memory.
  * (This case is only used at end of input data.)
@@ -2627,21 +2907,39 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 		   (LACKMEM(state) && state->memtupcount > 1) ||
 		   state->memtupcount >= state->memtupsize)
 	{
+		if (state->replaceActive && !alltuples)
+		{
+			/*
+			 * Still holding out for a case favorable to replacement selection;
+			 * perhaps there will be a single-run in the event of almost sorted
+			 * input, or perhaps work_mem hasn't been exceeded by too much, and
+			 * a "quicksort with spillover" remains possible.
+			 *
+			 * Dump the heap's frontmost entry, and sift up to remove it from
+			 * the heap.
+			 */
+			Assert(state->memtupcount > 1);
+			WRITETUP(state, state->tp_tapenum[state->destTape],
+					 &state->memtuples[0]);
+			tuplesort_heap_siftup(state, true);
+		}
+		else
+		{
+			/*
+			 * Once committed to quicksorting runs, never incrementally
+			 * spill
+			 */
+			dumpbatch(state, alltuples);
+			break;
+		}
+
 		/*
-		 * Dump the heap's frontmost entry, and sift up to remove it from the
-		 * heap.
+		 * If top run number has changed, we've finished the current run
+		 * (this can only be the first run, run 0), and will no longer spill
+		 * incrementally.
 		 */
 		Assert(state->memtupcount > 0);
-		WRITETUP(state, state->tp_tapenum[state->destTape],
-				 &state->memtuples[0]);
-		tuplesort_heap_siftup(state, true);
-
-		/*
-		 * If the heap is empty *or* top run number has changed, we've
-		 * finished the current run.
-		 */
-		if (state->memtupcount == 0 ||
-			state->currentRun != state->memtuples[0].tupindex)
+		if (state->memtuples[0].tupindex != 0)
 		{
 			markrunend(state, state->tp_tapenum[state->destTape]);
 			state->currentRun++;
@@ -2650,24 +2948,87 @@ dumptuples(Tuplesortstate *state, bool alltuples)
 
 #ifdef TRACE_SORT
 			if (trace_sort)
-				elog(LOG, "finished writing%s run %d to tape %d: %s",
-					 (state->memtupcount == 0) ? " final" : "",
+				elog(LOG, "finished writing heapsorted run %d to tape %d: %s",
 					 state->currentRun, state->destTape,
 					 pg_rusage_show(&state->ru_start));
 #endif
 
 			/*
-			 * Done if heap is empty, else prepare for new run.
+			 * Heap cannot be empty, so prepare for new run and give up on
+			 * replacement selection.
 			 */
-			if (state->memtupcount == 0)
-				break;
-			Assert(state->currentRun == state->memtuples[0].tupindex);
 			selectnewtape(state);
+			 /* All future runs will only use dumpbatch/quicksort */
+			state->replaceActive = false;
 		}
 	}
 }
 
 /*
+ * dumpbatch - sort and dump all memtuples, forming one run on tape
+ *
+ * Unlike classic replacement selection sort, second or subsequent runs are
+ * never heapified by this module (although heapification still respects run
+ * number differences between the first and second runs).  This helper
+ * handles the case where replacement selection is abandoned, and all tuples
+ * are quicksorted and dumped in memtuples-sized batches.  This alternative
+ * strategy is a simple hybrid sort-merge strategy, with quicksorting of
+ * memtuples-sized runs.
+ *
+ * In rare cases, this routine may add to an on-tape run already storing
+ * tuples.
+ */
+static void
+dumpbatch(Tuplesortstate *state, bool alltuples)
+{
+	int			memtupwrite;
+	int			i;
+
+	Assert(state->status == TSS_BUILDRUNS);
+
+	/* Final call might be unnecessary */
+	if (state->memtupcount == 0)
+	{
+		Assert(alltuples);
+		return;
+	}
+	tuplesort_quicksort(state);
+	state->currentRun++;
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished quicksorting run %d: %s",
+			 state->currentRun, pg_rusage_show(&state->ru_start));
+#endif
+
+	/*
+	 * This should be adopted to perform asynchronous I/O one day, as
+	 * dumping in batch represents a good opportunity to overlap I/O
+	 * and computation.
+	 */
+	memtupwrite = state->memtupcount;
+	for (i = 0; i < memtupwrite; i++)
+	{
+		WRITETUP(state, state->tp_tapenum[state->destTape],
+				 &state->memtuples[i]);
+		state->memtupcount--;
+	}
+	markrunend(state, state->tp_tapenum[state->destTape]);
+	state->tp_runs[state->destTape]++;
+	state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
+
+#ifdef TRACE_SORT
+	if (trace_sort)
+		elog(LOG, "finished writing quicksorted run %d to tape %d: %s",
+			 state->currentRun, state->destTape,
+			 pg_rusage_show(&state->ru_start));
+#endif
+
+	if (!alltuples)
+		selectnewtape(state);
+}
+
+/*
  * tuplesort_rescan		- rewind and replay the scan
  */
 void
@@ -2777,7 +3138,8 @@ void
 tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed)
+					long *spaceUsed,
+					int *rowsSortedMem)
 {
 	/*
 	 * Note: it might seem we should provide both memory and disk usage for a
@@ -2806,15 +3168,23 @@ tuplesort_get_stats(Tuplesortstate *state,
 				*sortMethod = "top-N heapsort";
 			else
 				*sortMethod = "quicksort";
+			*rowsSortedMem = state->memtupcount;
 			break;
 		case TSS_SORTEDONTAPE:
 			*sortMethod = "external sort";
+			*rowsSortedMem = 0;
+			break;
+		case TSS_MEMTAPEMERGE:
+			*sortMethod = "quicksort with spillover";
+			*rowsSortedMem = state->memtupcount;
 			break;
 		case TSS_FINALMERGE:
 			*sortMethod = "external merge";
+			*rowsSortedMem = 0;
 			break;
 		default:
 			*sortMethod = "still in progress";
+			*rowsSortedMem = -1;
 			break;
 	}
 }
@@ -2825,10 +3195,19 @@ tuplesort_get_stats(Tuplesortstate *state,
  *
  * Compare two SortTuples.  If checkIndex is true, use the tuple index
  * as the front of the sort key; otherwise, no.
+ *
+ * Note that for checkIndex callers, the heap invariant is never maintained
+ * beyond the first run, and so there are no COMPARETUP() calls beyond the
+ * first run.  It is assumed that checkIndex callers are maintaining the
+ * heap invariant for a replacement selection priority queue, but those
+ * callers do not go on to trust the heap to be fully-heapified past the
+ * first run.  Once currentRun isn't the first, memtuples is no longer a
+ * heap at all.
  */
 
 #define HEAPCOMPARE(tup1,tup2) \
-	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex) ? \
+	(checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
+					(tup1)->tupindex != 0) ? \
 	 ((tup1)->tupindex) - ((tup2)->tupindex) : \
 	 COMPARETUP(state, tup1, tup2))
 
@@ -2927,6 +3306,33 @@ sort_bounded_heap(Tuplesortstate *state)
 }
 
 /*
+ * Sort all memtuples using quicksort.
+ *
+ * Quicksort is tuplesort's internal sort algorithm.  It is also generally
+ * preferred to replacement selection of runs during external sorts, except
+ * where incrementally spilling may be particularly beneficial.  Quicksort
+ * will generally be much faster than replacement selection's heapsort
+ * because modern CPUs are usually bottlenecked on memory access, and
+ * quicksort is a cache-oblivious algorithm.
+ */
+static void
+tuplesort_quicksort(Tuplesortstate *state)
+{
+	if (state->memtupcount > 1)
+	{
+		/* Can we use the single-key sort function? */
+		if (state->onlyKey != NULL)
+			qsort_ssup(state->memtuples, state->memtupcount,
+					   state->onlyKey);
+		else
+			qsort_tuple(state->memtuples,
+						state->memtupcount,
+						state->comparetup,
+						state);
+	}
+}
+
+/*
  * Insert a new tuple into an empty or existing heap, maintaining the
  * heap invariant.  Caller is responsible for ensuring there's room.
  *
@@ -2954,6 +3360,17 @@ tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple,
 	memtuples = state->memtuples;
 	Assert(state->memtupcount < state->memtupsize);
 
+	/*
+	 * Once incremental heap spilling is abandoned, this routine should not be
+	 * called when loading runs.  memtuples will be an array of tuples in no
+	 * significant order, so calling here is inappropriate.  Even when
+	 * incremental spilling is still in progress, this routine does not handle
+	 * the second run's tuples (those are heapified to a limited extent that
+	 * they are appended, and thus kept away from those tuples in the first
+	 * run).
+	 */
+	Assert(!checkIndex || tupleindex == 0);
+
 	CHECK_FOR_INTERRUPTS();
 
 	/*
@@ -2985,6 +3402,13 @@ tuplesort_heap_siftup(Tuplesortstate *state, bool checkIndex)
 	int			i,
 				n;
 
+	/*
+	 * Once incremental heap spilling is abandoned, this routine should not be
+	 * called when loading runs.  memtuples will be an array of tuples in no
+	 * significant order, so calling here is inappropriate.
+	 */
+	Assert(!checkIndex || state->currentRun == 0);
+
 	if (--state->memtupcount <= 0)
 		return;
 
diff --git a/src/include/utils/tuplesort.h b/src/include/utils/tuplesort.h
index de6fc56..3679815 100644
--- a/src/include/utils/tuplesort.h
+++ b/src/include/utils/tuplesort.h
@@ -109,7 +109,8 @@ extern void tuplesort_end(Tuplesortstate *state);
 extern void tuplesort_get_stats(Tuplesortstate *state,
 					const char **sortMethod,
 					const char **spaceType,
-					long *spaceUsed);
+					long *spaceUsed,
+					int *rowsSortedMem);
 
 extern int	tuplesort_merge_order(int64 allowedMem);
 
-- 
1.9.1

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Using quicksort for every external sort run

Reply via email to