Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

Peter Geoghegan Thu, 30 Jul 2015 16:03:07 -0700

On Thu, Jul 30, 2015 at 12:00 AM, Heikki Linnakangas <hlinn...@iki.fi> wrote:
> Hmm. You don't really need to merge the in-memory array into the tape, as
> you know that all the tuples in the in-memory must come after the tuples
> already on the tape. You can just return all the tuples from the tape first,
> and then all the tuples from the array.


It's more complicated than it appears, I think. Tuples may be variable
sized. WRITETUP() performs a pfree(), and gives us back a variable
amount of availMem. What if we dumped a single, massive, outlier tuple
out when a caller passes it and it goes to the root of the heap? We'd
dump that massive tuple in one go (this would be an incremental
dumptuples() call, which we still do in the patch), making things
!LACKMEM() again, but by an usually comfortable margin. We read in a
few more regular tuples, but we're done consuming tuples before things
ever get LACKMEM() again (no more dumping needed, at least with this
patch applied).

What prevents the tuple at the top of the in-memory heap at the point
of tuplesort_performsort() (say, one of the ones added to the heap as
our glut of memory was *partially* consumed) being less than the
last/greatest tuple on tape? If the answer is "nothing", a merge step
is clearly required.

This is not a problem when every single tuple is dumped, but that
doesn't happen anymore. I probably should have shown more tests, that
tested HeapTuple sorts (not just datum tuple sorts). I agree that
things at least usually happen as you describe, FWIW.

> I think it'd make sense to structure the code differently, to match the way
> I described this optimization above. Instead of adding a new tuplesort state
> for this, abstract this in the logical tape code. Add a function to attach
> an in-memory "tail" to a tape, and have LogicalTapeRead() read from the tail
> after reading the on-disk tape. The rest of the code wouldn't need to care
> that sometimes part of the tape is actually in memory.

I'll need to think about all of that. Certainly, quicksorting runs in
a more general way seems like a very promising idea, and I know that
this patch does not go far enough. But it also add this idea of not
dumping out most tuples, which seems impossible to generalize further,
so maybe it's a useful special case to start from for that reason (and
also because it's where the pain is currently felt most often).

>> +        * Note that there might actually be 2 runs, but only the
>> +        * contents of one of them went to tape, and so we can
>> +        * safely "pretend" that there is only 1 run (since we're
>> +        * about to give up on the idea of the memtuples array being
>> +        * a heap).  This means that if our sort happened to require
>> +        * random access, the similar "single run" optimization
>> +        * below (which sets TSS_SORTEDONTAPE) might not be used at
>> +        * all.  This is because dumping all tuples out might have
>> +        * forced an otherwise equivalent randomAccess case to
>> +        * acknowledge a second run, which we can avoid.
>
>
> Is that really true? We don't start a second run until we have to, i.e. when
> it's time to dump the first tuple of the second run to tape. So I don't
> think the case you describe above, where you have two runs but only one of
> them has tuples on disk, can actually happen.

I think we're talking about two slightly different things. I agree
that I am avoiding "starting" a second run because I am avoiding
dumping tuples, just as you say (I describe this as avoiding
"acknowledging" a second run). But there could still be SortTuples
that have a tupindex that is > 0 (they could be 1, to be specific).
It's pretty clear from looking at the TSS_BUILDRUNS case within
puttuple_common() that this is true. So, if instead you define
"starting" a tuple as adding a sortTuple with a tupindex that is > 0,
then yes, this comment is true.

The important thing is that since we're not dumping every tuple, it
doesn't matter whether or not a that TSS_BUILDRUNS case within
puttuple_common() ever took the "currentRun + 1" insertion path (which
can easily happen), provided things aren't so skewed that it ends up
on tape even without dumping all tuples (which seems much less
likely). As I've said, this optimization will occur a lot more often
then the existing one run optimization (assuming !randomAccess), as a
nice side benefit of not dumping every tuple. Quicksort does not use
HEAPCOMPARE(), so clearly having multiple runs in that "subrun" is a
non-issue.

Whether or not we say that a second run "started", or that there was
merely the "unfulfilled intent to start a new, second run" is just
semantics. While I certainly care about semantics, my point is that we
agree that this useful "pretend there is only one run" thing happens
(I think). The existing one run optimization only really occurs when
the range of values in the set of tuples is well characterized by the
tuple values observed during initial heapification, which is bad. Or
would be bad, if the existing optimization was good.     :-)

>> Performance
>> ==========
>
>
> Impressive!
>
>> Predictability
>> ==========
>
>
> Even more impressive!

Thanks!

> As an extra optimization, you could delay quicksorting the in-memory array
> until it's time to read the first tuple from it. If the caller reads only
> the top-N tuples from the sort for some reason (other than LIMIT, which we
> already optimize for), that could avoid a lot of work.

Won't comment on that yet, since it's predicated on the merge step
being unnecessary. I need to think about this some more.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort and a merge step to significantly improve on tuplesort's single run "external sort"

Reply via email to