On Fri, Feb 5, 2016 at 9:31 AM, Robert Haas <robertmh...@gmail.com> wrote:
> Peter, please weigh in and let me know if I've gotten anything
> incorrect here or if you think of other concerns afterwards.

Right. Let me give you the executive summary first: I continue to
believe, following thinking about the matter in detail, that this is a
sensible compromise, that weighs everyone's concerns. It is pretty
close to a win-win. I just need you to confirm what I say here in
turn, so we're sure that we understand each other perfectly.

> The basic idea is that we will add a new GUC with a name like
> replacement_sort_mem that will have a default value in the range of
> 20-30MB; or possibly we will hardcode this value, but for purposes of
> this email I'm going to assume it's a GUC.  If the value of work_mem
> or maintenance_work_mem, whichever applies, is smaller than the value
> of replacement_sort_mem, then the latter has no effect.

By "no effect", you must mean that we always use a heap for the entire
first run (albeit for the tail, with a hybrid quicksort/heap
approach), but still use quicksort for every subsequent run, when it's
clearly established that we aren't going to get one huge run. Is that

It was my understanding, based on your emphasis on producing only a
single run, as well as your recent remarks on this thread about the
first run being special, that you are really only interested in the
presorted case, where one run is produced. That is, you are basically
not interested in preserving the general ability of replacement
selection to double run size in the event of a uniform distribution.
(That particular doubling property of replacement selection is now
technically lost by virtue of using this new hybrid model *anyway*,
although it will still make runs longer in general).

You don't want to change the behavior of the current patch for the
second or subsequent run; that should remain a quicksort, pure and
simple. Do I have that right?

BTW, parallel sort should probably never use a heap anyway (ISTM that
that will almost certainly be based on external sorts in the end). A
heap is not really compatible with the parallel heap scan model.

> One thing I just thought of (after the call) is that it might be
> better for this GUC to be in units of tuples rather than in units of
> memory; it's not clear to me why the optimal heap size should be
> dependent on the tuple size, so we could have a threshold like 300,000
> tuples or whatever.

I think you're right that a number of tuples is the logical way to
express the heap size (as a GUC unit). I think that the ideal setting
for the GUC is large enough to recognize significant correlations in
input data, which may be clustered, but no larger (at least while
things don't all fit in L1 cache, or maybe L2 cache). We should "go
for broke" with replacement selection -- we don't aim for anything
less than ending up with 1 run by using the heap (merging 2 or 3 runs
rather than 4 or 6 is far less useful, maybe harmful, when one of them
is much larger). Therefore, I don't expect that we'll be practically
disadvantaged by having fewer "hands to juggle" tuples here (we'll
simply almost always have enough in practice -- more on that later).
FWIW I don't think that any benchmark we've seen so far justifies
doing less than "going for broke" with RS, even if you happen to have
a very conservative perspective.

One advantage of a GUC is that you can set it to zero, and always get
a simple hybrid sort-merge strategy if that's desirable. I think that
it might not matter much with multi-gigabyte work_mem settings anyway,
though; you'll just see a small blip. Big (maintenance_)work_mem was
by far my greatest concern in relation to using a heap in general, so
I'm left pretty happy by this plan, I think. Lots of people can afford
a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna
be the most important case overall, by far.

> 2. If (maintenance_)work_mem fills up completely, we will quicksort
> all of the data we have in memory.  We will then regard the tail end
> of that sorted data, in an amount governed by replacement_sort_mem, as
> a heap, and use it to perform replacement selection until no tuples
> remain for the current run.  Meanwhile, the rest of the sorted data
> remains in memory untouched.  Logically, we're constructing a run of
> tuples which is split between memory and disk: the head of the run
> (what fits in all of (maintenance_)work_mem except for
> replacement_sort_mem) is in memory, and the tail of the run is on
> disk.

I went back and forth on this during our call, but I now think that I
was right that there will need to be changes in order to make the tail
of the run a heap (*not* the quicksorted head), because routines like
tuplesort_heap_siftup() assume that state->memtuples[0] is the head of
the heap. This is currently assumed by the master branch for both the
currentRun/nextRun replacement selection heap, as well as the heap
used for merging. Changing this is probably fairly manageable, though
(probably still not going to use memmove() for this, contrary to my
remarks on the call).

> 3. If we reach the end of input before replacement selection runs out
> of tuples for the current run, and if it finds no tuples for the next
> run prior to that time, then we are done.  All of the tuples form a
> single run and we can return the tuples in memory first followed by
> the tuples on disk.  This case is highly likely to be a huge win over
> what we have today, because (a) some portion of the tuples were sorted
> via quicksort rather than heapsort and that's faster, (b) the tuples
> that were sorted using a heap were sorted using a small heap rather
> than a big one, and (c) we only wrote out the minimal number of tuples
> to tape instead of, as we would have done today, all of them.


> 4. If we reach this step, then replacement selection with a small heap
> wasn't able to sort the input in a single run.  We have a bunch of
> sorted data in memory which is the head of the same run whose tail is
> already on disk; we now spill all of these tuples to disk.  That
> leaves only the heapified tuples in memory.  We just ignore the fact
> that they are a heap and treat them as unsorted.  We repeatedly do the
> following: read tuples until work_mem is full, sort them, and dump the
> result to disk as a run.  When all runs have been created, we merge
> runs just as we do today.

Right, so: having read this far, I'm almost sure that you intend that
replacement selection is only ever used for the first run (we "go for
broke" with RS). Good.

> This algorithm seems very likely to beat what we do today in
> practically all cases.  The benchmarking Peter and others have already
> done shows that building runs with quicksort rather than replacement
> selection can often win even if the larger number of tapes requires a
> multi-pass merge.  The only cases where it didn't seem to be a clear
> win involved data that was already in sorted order, or very close to
> it.

...*and* where there was an awful lot of data, *and* where there was
very little memory in an absolute sense (e.g. work_mem = 4MB).

> But with this algorithm, presorted input is fine: we'll quicksort
> some of it (which is faster than replacement selection because
> quicksort checks for presorted input) and sort the rest with a *small*
> heap (which is faster than our current approach of sorting it with a
> big heap when the data is already in order).

I'm not going to defend the precheck in our quicksort implementation.
It's unadulterated nonsense. The B&M quicksort implementation's use of
insertion sort does accomplish this pretty well, though.

> On top of that, we'll
> only write out the minimal amount of data to disk rather than all of
> it.  So we should still win.  On the other hand, if the data is out of
> order, then we will do only a little bit of replacement selection
> before switching over to building runs by quicksorting, which should
> also win.

Yeah -- we retain much of the benefit of "quicksort with spillover",
too, without any cost model. This is also better than "quicksort with
spillover" in that it limits the size of the heap, and so limits the
extent to which the algorithm can "helpfully" spend ages spilling from
an enormous heap. The new GUC can be explained to users as a kind of
minimum burst capacity for getting a "half internal, half external"
sort, which seems intuitive enough.

> The worst case I was able to think of for this algorithm is an input
> stream that is larger than work_mem and almost sorted: the only
> exception is that the record that should be exactly in the middle is
> all the way at the end.

> We need to not be horrible in that case, but there's
> absolutely no reason to believe that we will be.  We may even be
> faster, but we certainly shouldn't be abysmally slower.


If we take a historical perspective, a 10MB or 30MB heap will still
have a huge "juggling capacity" -- in practice it will almost
certainly store enough tuples to make the "plate spinning circus
trick" of replacement selection make the critical difference to run
size. This new GUC is a delta between tuples for RS reordering. You
can perhaps construct a "strategically placed banana skin" case to
make this look bad before caching effects start to weigh us down, but
I think you agree that it doesn't matter. "Juggling capacity" has
nothing to do with modern hardware characteristics, except that modern
machines are where the cost of excessive "juggling capacity" really
hurts, so this is simple. It is simple *especially* because we can
throw out the idea of a cost model that cares about caching effects in
particular, but that's just one specific thing.

BTW, you probably know this, but to be clear: When I talk about
correlation, I refer specifically to what would appear within
pg_stats.correlation as 1.0 -- I am not referring to a
pg_stats.correlation of -1.0. The latter case is traditionally
considered a worst case for RS.

Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to