Re: [HACKERS] Using quicksort for every external sort run

Robert Haas Sun, 07 Feb 2016 09:22:07 -0800

On Sun, Feb 7, 2016 at 11:00 AM, Peter Geoghegan <p...@heroku.com> wrote:
> Right. Let me give you the executive summary first: I continue to
> believe, following thinking about the matter in detail, that this is a
> sensible compromise, that weighs everyone's concerns. It is pretty
> close to a win-win. I just need you to confirm what I say here in
> turn, so we're sure that we understand each other perfectly.


Makes sense to me.

>> The basic idea is that we will add a new GUC with a name like
>> replacement_sort_mem that will have a default value in the range of
>> 20-30MB; or possibly we will hardcode this value, but for purposes of
>> this email I'm going to assume it's a GUC.  If the value of work_mem
>> or maintenance_work_mem, whichever applies, is smaller than the value
>> of replacement_sort_mem, then the latter has no effect.
>
> By "no effect", you must mean that we always use a heap for the entire
> first run (albeit for the tail, with a hybrid quicksort/heap
> approach), but still use quicksort for every subsequent run, when it's
> clearly established that we aren't going to get one huge run. Is that
> correct?

Yes.

> It was my understanding, based on your emphasis on producing only a
> single run, as well as your recent remarks on this thread about the
> first run being special, that you are really only interested in the
> presorted case, where one run is produced. That is, you are basically
> not interested in preserving the general ability of replacement
> selection to double run size in the event of a uniform distribution.
> (That particular doubling property of replacement selection is now
> technically lost by virtue of using this new hybrid model *anyway*,
> although it will still make runs longer in general).
>
> You don't want to change the behavior of the current patch for the
> second or subsequent run; that should remain a quicksort, pure and
> simple. Do I have that right?

Yes.

> BTW, parallel sort should probably never use a heap anyway (ISTM that
> that will almost certainly be based on external sorts in the end). A
> heap is not really compatible with the parallel heap scan model.

I don't think I agree with this part, though I think it's unimportant
as far as the current patch is concerned.  My initial thought is that
parallel sort should work like this:

1. Each worker reads and sorts its input tuples just as it would in
non-parallel mode.

2. If, at the conclusion of the sort, the input tuples are still in
memory (quicksort) or partially in memory (quicksort with spillover),
then write them all to a tape.  If they are on multiple tapes, merge
those to a single tape.  If they are on a single tape, do nothing else
at this step.

3. At this point, we have one sorted tape per worker.  Perform a final
merge pass to get the final result.

The major disadvantage of this is that if the input hasn't been
relatively evenly partitioned across the workers, the work of sorting
will fall disproportionately on those that got more input.  We could,
in the future, make the logic more sophisticated.  For example, if
worker A is still reading the input and dumping sorted runs, worker B
could start merging those runs.  Or worker A could read tuples into a
DSM instead of backend-private memory, and worker B could then sort
them to produce a run.  While such optimizations are clearly
beneficial, I would not try to put them into a first parallel sort
patch.  It's too complicated.

>> One thing I just thought of (after the call) is that it might be
>> better for this GUC to be in units of tuples rather than in units of
>> memory; it's not clear to me why the optimal heap size should be
>> dependent on the tuple size, so we could have a threshold like 300,000
>> tuples or whatever.
>
> I think you're right that a number of tuples is the logical way to
> express the heap size (as a GUC unit). I think that the ideal setting
> for the GUC is large enough to recognize significant correlations in
> input data, which may be clustered, but no larger (at least while
> things don't all fit in L1 cache, or maybe L2 cache). We should "go
> for broke" with replacement selection -- we don't aim for anything
> less than ending up with 1 run by using the heap (merging 2 or 3 runs
> rather than 4 or 6 is far less useful, maybe harmful, when one of them
> is much larger). Therefore, I don't expect that we'll be practically
> disadvantaged by having fewer "hands to juggle" tuples here (we'll
> simply almost always have enough in practice -- more on that later).
> FWIW I don't think that any benchmark we've seen so far justifies
> doing less than "going for broke" with RS, even if you happen to have
> a very conservative perspective.
>
> One advantage of a GUC is that you can set it to zero, and always get
> a simple hybrid sort-merge strategy if that's desirable. I think that
> it might not matter much with multi-gigabyte work_mem settings anyway,
> though; you'll just see a small blip. Big (maintenance_)work_mem was
> by far my greatest concern in relation to using a heap in general, so
> I'm left pretty happy by this plan, I think. Lots of people can afford
> a multi-GB maintenance_work_mem these days, and CREATE INDEX is gonna
> be the most important case overall, by far.

Agreed.  I suspect that a default setting that is relatively small but
not zero will be good for most people, but if some people find
advantage in changing it to a smaller value, or zero, or a larger
value, that's fine with me.

>> 2. If (maintenance_)work_mem fills up completely, we will quicksort
>> all of the data we have in memory.  We will then regard the tail end
>> of that sorted data, in an amount governed by replacement_sort_mem, as
>> a heap, and use it to perform replacement selection until no tuples
>> remain for the current run.  Meanwhile, the rest of the sorted data
>> remains in memory untouched.  Logically, we're constructing a run of
>> tuples which is split between memory and disk: the head of the run
>> (what fits in all of (maintenance_)work_mem except for
>> replacement_sort_mem) is in memory, and the tail of the run is on
>> disk.
>
> I went back and forth on this during our call, but I now think that I
> was right that there will need to be changes in order to make the tail
> of the run a heap (*not* the quicksorted head), because routines like
> tuplesort_heap_siftup() assume that state->memtuples[0] is the head of
> the heap. This is currently assumed by the master branch for both the
> currentRun/nextRun replacement selection heap, as well as the heap
> used for merging. Changing this is probably fairly manageable, though
> (probably still not going to use memmove() for this, contrary to my
> remarks on the call).

OK.  I think if possible we want to try to do this by changing the
Tuplesortstate to identify where the heap is, rather than by using
memmove() to put it where we want it to be.

>> 3. If we reach the end of input before replacement selection runs out
>> of tuples for the current run, and if it finds no tuples for the next
>> run prior to that time, then we are done.  All of the tuples form a
>> single run and we can return the tuples in memory first followed by
>> the tuples on disk.  This case is highly likely to be a huge win over
>> what we have today, because (a) some portion of the tuples were sorted
>> via quicksort rather than heapsort and that's faster, (b) the tuples
>> that were sorted using a heap were sorted using a small heap rather
>> than a big one, and (c) we only wrote out the minimal number of tuples
>> to tape instead of, as we would have done today, all of them.
>
> Agreed.

Cool.

>> 4. If we reach this step, then replacement selection with a small heap
>> wasn't able to sort the input in a single run.  We have a bunch of
>> sorted data in memory which is the head of the same run whose tail is
>> already on disk; we now spill all of these tuples to disk.  That
>> leaves only the heapified tuples in memory.  We just ignore the fact
>> that they are a heap and treat them as unsorted.  We repeatedly do the
>> following: read tuples until work_mem is full, sort them, and dump the
>> result to disk as a run.  When all runs have been created, we merge
>> runs just as we do today.
>
> Right, so: having read this far, I'm almost sure that you intend that
> replacement selection is only ever used for the first run (we "go for
> broke" with RS). Good.

Yes, absolutely.

>> This algorithm seems very likely to beat what we do today in
>> practically all cases.  The benchmarking Peter and others have already
>> done shows that building runs with quicksort rather than replacement
>> selection can often win even if the larger number of tapes requires a
>> multi-pass merge.  The only cases where it didn't seem to be a clear
>> win involved data that was already in sorted order, or very close to
>> it.
>
> ...*and* where there was an awful lot of data, *and* where there was
> very little memory in an absolute sense (e.g. work_mem = 4MB).
>
>> But with this algorithm, presorted input is fine: we'll quicksort
>> some of it (which is faster than replacement selection because
>> quicksort checks for presorted input) and sort the rest with a *small*
>> heap (which is faster than our current approach of sorting it with a
>> big heap when the data is already in order).
>
> I'm not going to defend the precheck in our quicksort implementation.
> It's unadulterated nonsense. The B&M quicksort implementation's use of
> insertion sort does accomplish this pretty well, though.

We'll leave that discussion for another day so as not to argue about it now.

>> On top of that, we'll
>> only write out the minimal amount of data to disk rather than all of
>> it.  So we should still win.  On the other hand, if the data is out of
>> order, then we will do only a little bit of replacement selection
>> before switching over to building runs by quicksorting, which should
>> also win.
>
> Yeah -- we retain much of the benefit of "quicksort with spillover",
> too, without any cost model. This is also better than "quicksort with
> spillover" in that it limits the size of the heap, and so limits the
> extent to which the algorithm can "helpfully" spend ages spilling from
> an enormous heap. The new GUC can be explained to users as a kind of
> minimum burst capacity for getting a "half internal, half external"
> sort, which seems intuitive enough.

Right.  I really like the idea of limiting the heap size - I'm quite
hopeful that will let us hang onto the limited number of cases where
RS is better while giving up on it pretty quickly when it's a loser.
But even better, if you've got a case where RS is a win, limiting the
heap size has an excellent chance of making it a bigger win.  That's
quite appealing, too.

>> The worst case I was able to think of for this algorithm is an input
>> stream that is larger than work_mem and almost sorted: the only
>> exception is that the record that should be exactly in the middle is
>> all the way at the end.
>
>> We need to not be horrible in that case, but there's
>> absolutely no reason to believe that we will be.  We may even be
>> faster, but we certainly shouldn't be abysmally slower.
>
> Agreed.
>
> If we take a historical perspective, a 10MB or 30MB heap will still
> have a huge "juggling capacity" -- in practice it will almost
> certainly store enough tuples to make the "plate spinning circus
> trick" of replacement selection make the critical difference to run
> size. This new GUC is a delta between tuples for RS reordering. You
> can perhaps construct a "strategically placed banana skin" case to
> make this look bad before caching effects start to weigh us down, but
> I think you agree that it doesn't matter. "Juggling capacity" has
> nothing to do with modern hardware characteristics, except that modern
> machines are where the cost of excessive "juggling capacity" really
> hurts, so this is simple. It is simple *especially* because we can
> throw out the idea of a cost model that cares about caching effects in
> particular, but that's just one specific thing.

Yep.  I'm mostly relying on you to be correct about the actual
performance characteristics of replacement selection here.  If the
cutover point when we go from RS to QS to build runs turns out to be
wildly wrong, I plan to look sidelong in your direction.  I don't
think that's going to happen, though.

> BTW, you probably know this, but to be clear: When I talk about
> correlation, I refer specifically to what would appear within
> pg_stats.correlation as 1.0 -- I am not referring to a
> pg_stats.correlation of -1.0. The latter case is traditionally
> considered a worst case for RS.

Makes sense.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to