Re: [PATCHES] WIP: further sorting speedup

Simon Riggs Mon, 20 Feb 2006 10:45:49 -0800

On Sun, 2006-02-19 at 21:40 -0500, Tom Lane wrote:
> After applying Simon's recent sort patch, I was doing some profiling and
> noticed that sorting spends an unreasonably large fraction of its time
> extracting datums from tuples (heap_getattr or index_getattr).  The
> attached patch does something about this by pulling out the leading sort
> column of a tuple when it is received by the sort code or re-read from a
> "tape".  This increases the space needed by 8 or 12 bytes (depending on
> sizeof(Datum)) per in-memory tuple, but doesn't cost anything as far as
> the on-disk representation goes.  The effort needed to extract the datum
> at this point is well repaid because the tuple will normally undergo
> multiple comparisons while it remains in memory.  In some quick tests
> the patch seemed to make for a significant speedup, on the order of 30%,
> despite increasing the number of runs emitted because of the smaller
> available memory.


Yeh, this is essentially the cache-the-heapgetattr idea. I'd been trying
to get that to perform by caching the fcinfo values, which was a more
complex way and relied on full key extraction. To my chagrin, I had
great difficulty that way, but now I see the benefit of first-key
extraction explains why. The 30% speedup sounds like my original
expectation. Anyway, kudos to you. 

[I'd stopped working on that to give Tim some space]

> The choice to pull out just the leading column, rather than all columns,
> is driven by concerns of (a) code complexity and (b) memory space.
> Having the extra columns pre-extracted wouldn't buy anything anyway
> in the common case where the leading key determines the result of
> a comparison.

I think key extraction is a good idea, for more reasons than just the
heapgetattr. 

For longer heap rows, putting the whole row through the sort is inferior
to extracting all keys plus a pointer to the tuple, according to:

"AlphaSort: A Cache-Sensitive Parallel External
Sort", Nyberg et al, VLDB Journal 4(4): 603-627 (1995)

The above paper makes a good case that full key extraction is a great
idea above a tuple length of 16 bytes, i.e. we don't need to do it for
most CREATE INDEX situations, but it would be very helpful for heap
sorts.

I agree that as long as we are swamped by the cost of heapgetattr, then
it does seem likely that first-key extraction (and keeping it with the
tuple itself) will be a win in most cases over full-key extraction.

Nyberg et al also touch on a further point, which Luke has just
mentioned briefly on list (but we have discussed at further length). Now
that we have dropped the restriction of N=6, giving very large numbers
of runs, this also weakens the argument as to why heapsort is a good
candidate for sort algorithm. The reason for choosing heapsort was that
it gave runs ~2*Size(memory), whereas qsort produces runs only
~Size(memory). But if we have as many runs as we like, then using qsort
is not an issue. Which is good because qsort is faster in practice and
much more importantly, performs better with larger memory: heap sort
seems to suffer from a slow down when you give it *too much* memory. 

Which leaves the conclusion: further tuning of the heapsort mechanism is
*probably* not worthwhile in relation to the run forming stage of
sorting. (The variable nature of the qsort algorithm might seem an
issue, but if we execute it N times for N runs, then we'll get a much
less variable performance from it than we saw on those individual tests
earlier, so the predictability of the heapsort isn't as important a
reason to keep it).

But it seems this patch provides a win that is not dependent upon the
sort algorithm used, so its a win whatever we do.

(We still need the heapsort for the final merge, so I think we still
need to look at tuning of the final merge stage when we have a very
large work_mem setting, or simply limiting the size of the heap used
above a certain point.)

> This is still WIP because it leaks memory intra-query (I need to fix it
> to clean up palloc'd space better).  I thought I'd post it now in case
> anyone wants to try some measurements for their own favorite test cases.

> In particular it would be interesting to see what happens for a
> multi-column sort with lots of duplicated keys in the first column,
> which is the case where the least advantage would be gained.

Hmmm, well it seems clear that there will be an optimum number of keys
to be pre-extracted for any particular sort, though we will not be able
to tell what that is until we are mid-way through the sort. Using
heapsort we wouldn't really have much opportunity to decide to change
the number of columns pre-extracted...if we did use qsort, we could
learn from the last run how many keys to pre-extract.

Incidentally, I do think heapgetattr can be tuned further. When a tuple
has nulls we never use cached offset values. However, we could work out
the firstnullableattno for a table/resultset and keep cached offset
values for all attnum < firstnullableattno. If the tuple contains nulls,
we would just add the bytes for the null flags onto the offset. Non
nullable columns tend to be at the front of rows and also tend to be
targets for sort keys.

Best Regards, Simon Riggs


---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: [PATCHES] WIP: further sorting speedup

Reply via email to