Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

Peter Geoghegan Thu, 06 Aug 2015 22:47:10 -0700

On Tue, Aug 4, 2015 at 1:24 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Yeah, I don't think there's a big performance difference between the two
> approaches. I'm not wedded to either approach. Whichever approach we use, my
> main point was that it would be better to handle this in the logical tape
> abstraction. In my approach, you would have those "in-memory tails" attached
> to the last two tapes. In Peter's approach, you would have one tape that's
> completely in memory, backed by the array. In either case, the tapes would
> look like normal tapes to most of tuplesort.c. There would be no extra TSS
> state, it would be TSS_SORTEDONTAPE or TSS_FINALMERGE as usual.


TBH, it's not clear to me why making the logical tape abstraction
manage some fraction of memtuples has any advantage at all. The only
case that I can imagine where this could be useful is where a logical
tape does asynchronous I/O, and has the benefit of some portion of
memtuples (probably half) as scratch space (most I/O probably occurs
while tuplesort quicksorts the other half of memtuples). But that has
nothing to do with building a better abstraction.

The more expansive patch that I'm currently working on, that covers
every external sorting case -- the patch to *also* quicksort runs past
the first run, regardless of whether or not we can get away with a
"quicksort with spillover" -- is going very well. I haven't had a
solid block of time to work on it this week due to other commitments,
but it isn't very complicated. Performance benefits are considerable
even without saving any I/O. I can be as much as twice as fast for
"merge sort" sorts in some cases. So not quite as nice as "quicksort
with spillover", but still a significant improvement considering
writing everything out is inevitable for the cases helped.

As I said before, the upcoming patch has tuplesort give up on
memtuples *ever* being a heap after the first run, whatever happens. I
just quicksort and dump in batches past the first run. Since I give up
on replacement selection sort only after the first run, I still have
most of the upside of replacement selection, but little of the big
downside of heap maintenance. This will result in smaller runs on
average past the first run. I can give you at least 3 very strong
arguments for why this is totally worth it in every case, but I'll
wait till I'm asked for them.  :-)

One useful saving made in this upcoming multi-tape-run patch is that
it never treats any non-current run as part of the heap beyond its run
number, even when currentRun is the first (run 0). So no comparisons
occur beyond the first run to maintain the heap invariant *even when
the first run is current*  -- tuples are simply appended that belong
to the second run (we only do an extra comparison to determine that
that's the run they belong in). So the second run (run 1) is not
trusted to be heapified by dumptuples(), and is quicksorted (either by
"quicksort with spillover", along with much of the first run, or on
its own, when there are multiple conventional on-tape runs; it doesn't
matter which way it is quicksorted). From there on, every run is
quicksorted when memtuples fills, and written out entirely in memtuple
sized batches.

> The logical tape abstraction is currently too low-level for that. It's just
> a tape of bytes, and tuplesort.c determines where a tuple begins and ends.
> That needs to be changed so that the logical tape abstraction works
> tuple-at-a-time instead. For example, instead of LogicalTapeRead(N) which
> reads N bytes, you would have LogicalTapeReadTuple(), which reads next
> tuple, and returns its length and the tuple itself. But that would be quite
> sensible anyway.

Why would it be sensible? I honestly wonder why you want to do things
that way. What is the advantage of not having what I call the
in-memory "subrun" managed by a logical tape? It's already nothing
like any other type of run in several ways. Aside from being all
in-memory, it is often much larger. It's special in that it kind of
rejects the preliminary determination that some tuples within
memtuples need to be in a second, traditional, on-tape run (because we
can just quicksort everything and merge with the existing single
on-tape run). Also, we now need tuplesort_gettuple_common() to ask
READTUP() what to tell its caller about freeing memory that is
allocated in within tuplesort.c directly.

The memtuples array is already treated as an array, a heap, the head
of each tape that is merged, and maybe one other thing that I forgot
about offhand. The field SortTuple.tupindex has a total of 3
context-dependent meanings. Playing these kind of games with the
memtuples array is very much something that happens already.

More than anything else, I think that the new TSS_MEMTAPEMERGE state
is justified as a special case because "quicksort with spillover" is
legitimately a special case. Users will want to know how close they
were to an internal sort when looking at EXPLAIN ANALYZE and so on.
When cost_sort() is fixed to be a continuous function (which I think
will be pretty nice for certain other problems), the reader of that
code will want to know more about this "quicksort with spillover"
special case that can save 99% of I/O for what is still classified as
an external sort -- they will look in tuplesort.c for it, not the
logical tape code.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

Reply via email to