Re: [HACKERS] Using quicksort for every external sort run

Peter Geoghegan Tue, 29 Mar 2016 17:55:43 -0700

On Tue, Mar 29, 2016 at 12:43 PM, Peter Geoghegan <[email protected]> wrote:
> A complete do-over from Tomas would be best, here. He has already
> acknowledged that the i5 CREATE INDEX results were completely invalid.


The following analysis is all based on Xeon numbers, which as I've
said we should focus on pending a do-over from Tomas. Especially
important here is the larget set -- the 10M numbers from
results-xeon-10m.ods.

I think that abbreviation distorts things here. We also see distortion
from "padding" cases.

Rather a lot of "padding" is used, FWIW. From Tomas' script:

INSERT INTO numeric_test_padding SELECT a, repeat(md5(a::text),10)
FROM data_float ORDER BY a;

This makes the tests have TOAST overhead.

Some important observations on results-xeon-10m:

* There are almost no regressions for types that don't use
abbreviation. There might be one exception when there is both padding
and presorted input -- the 32MB
high_cardinality_almost_asc/high_cardinality_sorted/unique_sorted
"SELECT * FROM int_test_padding ORDER BY a", which takes 26% - 35%
longer (those are all basically the same cases). But it's a big win in
the high_cardinality_random, unique_random, and even unique_almost_asc
categories, or when DESC order was requested in all categories (I note
that there is certainly an emphasis on pre-sorted cases in the choice
of categories). Other than that, no regressions from non-abbreviated
types.

* No CREATE INDEX case is ever appreciably regressed, even with
maintenance_work_mem at 8MB, 1/4 of its default value of 64MB. (Maybe
we lose 1% - 3% with the other (results-xeon-1m.ods) cases, where
maintenance_work_mem is close to or actually high enough to get an
internal sort). It's a bit odd that "CREATE INDEX x ON
text_test_padding (a)" is about a wash for
high_cardinality_almost_asc, but I think that's just because we're
super I/O bound for this presorted case, but cannot make up for it
with quicksort's "bubble sort best case" precheck for presortedness,
so replacement selection does better in a way that might even result
in a clean draw. CREATE INDEX looks very good in general. I think
abbreviation might abort in one or two cases for text, but the picture
for the patch is still solid.

* "Padding" can really distort low-end cases, that become more about
moving big tuples around than actual sorting. If you really want to
see how high_cardinality_almost_asc queries like "SELECT * FROM
text_test_padding ORDER BY a" are testing the wrong thing, consider
the best and worst case for the master branch with any amount of
work_mem. The 10 million tuple high_cardinality_almost_asc case takes
40.16 seconds, 39.95 seconds, 40.98 seconds, and 41.28 seconds, and
42.1 seconds for respective work_mems of 8MB, 32MB, 128MB, 512MB, and
1024MB. This is a very narrow case because it totally deemphasizes
comparison cost and emphasizes moving tuples around, involves
abbreviation of text with a merge phase that cannot use abbreviation
that only the patch has due to RS best-case on master. The case is
seriously short changed by the memory batching refund thing in
practice. When is *high cardinality text* (not dates or something)
ever likely to be found in pre-sorted order for 10 million tuples in
the real world? Besides, we just stopped trusting strxfrm(), so the
case would probably be a wash now at worst.

* The more plausible padding + presorted + abbreviation case that is
sometimes regressed is "SELECT * FROM numeric_test_padding ORDER BY
a". But that's regressed a lot less than the aforementioned "SELECT *
FROM text_test_padding ORDER BY a" case, and only at the low end. It
is sometimes faster where the original case I mentioned is slower.

* Client overhead may distort things in the case of queries like
"SELECT * FROM foo ORDER BY bar". This could be worse for the patch,
which does relatively more computation during the final on-the-fly
merge phase (which is great when you can overlap that with I/O;
perhaps not when you get more icache misses with other computation).
Aside from just adding a lot of noise, this could unfairly make the
patch look a lot worse than master.

Now, I'm not saying all of this doesn't matter. But these are all
fairly narrow, pathological cases, often more about moving big tuples
around (in memory and on disk) than about sorting. These regressions
are well worth it. I don't think I can do any more than I already have
to fix these cases; it may be impossible. It's a very difficult thing
to come up with an algorithm that's unambiguously better in every
possible case. I bent over backwards to fix low-end regressions
already.

In memory rich environments with lots of I/O bandwidth, I've seen this
patch make CREATE INDEX ~2.5x faster for int4, on a logged table. More
importantly, the patch makes setting maintenance_work_mem easy. Users'
intuition for how sizing it ought to work now becomes more or less
correct: In general, for each individual utility command bound by
maintenance_work_mem, more memory is better. That's the primary value
in having tuple sorting be cache oblivious for us; the smooth cost
function of sorting makes tuning relatively easy, and gives us a
plausible path towards managing local memory for sorting and hashing
dynamically for the entire system. I see no other way for us to get
there.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to