Re: [HACKERS] Using quicksort for every external sort run

Tomas Vondra Wed, 23 Mar 2016 20:07:06 -0700

Hi,

On 03/24/2016 03:00 AM, Peter Geoghegan wrote:

On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:

I've tested the patch you've sent on 2016/3/11, which I believe is the last
version. I haven't tuned the replacement_sort_mem at all. because
my understanding was that it's not a 9.6 material (per your
message). So my intent was to test the configuration people are
likely to use by default.


I meant that using replacement selection in a special way with
CREATE INDEX was not 9.6 material. But replacement_sort_mem is. And
so, any case with the (maintenance)_work_mem <= 16MB will have used a
heap for the first run.

FWIW, maintenance_work_mem was set to 1GB on the i5 machine and 256MB onthe Xeon. Hmm, maybe that's why we see no difference for CREATE INDEX onthe i5, and an improvement on the Xeon.


I'm sorry I did not make a point of telling you this. It's my fault.
The result in any case is that pre-sorted cases will be similar with
and without the patch, since replacement selection can thereby make
one long run. But on non-sorted cases, the patch helps less because
it is in use less -- with not so much data overall, possibly much
less (which I think explains why the 1M row tests seem so much less
interesting than the 10M row tests).

Not a big deal - it's easy enough to change the config and repeat thebenchmark. Are there any particular replacement_sort_mem values that youthink would be interesting to configure?

I have to admit I'm a bit afraid we'll introduce a new GUC that onlyvery few users will know how to set properly, and so most people willrun with the default value or set it to something stupid.


I worry that at the low end, replacement_sort_mem makes the patch
have one long run, but still some more other runs, so merging is
unbalanced. We should consider if the patch can beat the master
branch at the low end without using a replacement selection heap. It
would do better in at least some cases in low memory conditions,
possibly a convincing majority of cases. I had hoped that my recent
idea (since committed) of resetting memory contexts would help a lot
with regressions when work_mem is very low, and that particular
theory isn't really tested here.

Are you saying none of the queries triggers the memory context resets?What queries would trigger that (to test the theory)?

I'm not sure which commit are you referring to. The benchmark was
done on a414d96a (from 2016/3/10). However I'd expect that to
affect both sets of measurements, although it's possible that it
affects the patched version differently.


You did test the right patches. It just so happens that the master
branch now has the memory batching stuff now, so it doesn't get
credited with that. I think this is good, though, because we care
about 9.5 -> 9.6 regressions.

So there's a commit in master (but not in 9.5), adding memory batching,but it got committed before a414d96a so the benchmark does not measureit's impact (with respect to 9.5). Correct?

But if we care about 9.5 -> 9.6 regressions, then perhaps we shouldinclude that commit into the benchmark, because that's what the userswill see? Or have I misunderstood the second part?

BTW which patch does the memory batching? A quick search through git logdid not return any recent patches mentioning these terms.

Improvement ratio (master time/patched time) for Xeon 10 million row
case "SELECT * FROM int_test_padding ORDER BY a DESC":

For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47,
1024MB = 1.00

So, it gets faster than the master branch as more memory is
available, but then it goes to 1.00 -- a perfect draw. I think that
this happened simply because at that point, the sort was an internal
sort (even though similar CREATE INDEX case did not go internal at
the same point). The (internal) 1024MB case is not that much faster
than the 512MB external case, which is pretty good.


Indeed.


There are also "near draws", where the ratio is 0.98 or so. I think
that this is because abbreviation is aborted, which can be a problem
with synthetic data + text -- you get a very slow sort either way,

That is possible, yes. It's true that the worst regressions are on text,although there are a few on numeric too (albeit not as significant).

where most time is spent calling strcoll(), and cache
characteristics matter much less. Those cases seemingly take much
longer overall, so this theory makes sense. Unfortunately,
abbreviated keys for text that is not C locale text was basically
disabled across the board today due to a glibc problem. :-(


Yeah. Bummer :-(


Whenever I see that the patch is exactly as fast as the master
branch, I am skeptical. I am particularly skeptical of all i5
results (including 10M cases), because the patch seems to be almost
perfectly matched to the master branch for CREATE INDEX cases (which
are the best cases for the patch on your Xeon server) -- it's much
easier to believe that there was a problem during the test, honestly,
like maintenance_work_mem wasn't set correctly. Those two things are

As I mentioned above, I haven't realized work_mem does not matter forCREATE INDEX, and maintenance_work_mem was set to a fixed value for thewhole test. And the two machines used different values for thisparticular configuration value - Xeon used just 256MB, while i5 used1GB. So while on i5 it was just a single chunk, on Xeon there weremultiple batches. Hence the different behavior.

so different that I have a hard time imagining that they'd ever
really draw. I mean, it's possible, but it's more likely to be a
problem with testing. And, queries like "SELECT * FROM
int_test_padding ORDER BY a DESC" return all rows, which adds noise
from all the client overhead. In fact, you often see that adding more


No it doesn't add overhead. The script actually does

COPY (query) TO '/dev/null'

on the server for all queries (except for the CREATE INDEX, obviously),so there should be pretty much no overhead due to transferring rows tothe client and so on.

memory helps no case here, so it seem a bit pointless. Maybe they
should be written like "SELECT * FROM (select * from int_test_padding
ORDER BY a DESC OFFSET 1e10) ff" instead. And maybe queries like
"SELECT DISTINCT a FROM int_test ORDER BY a" would be better as
"SELECT COUNT(DISTINCT a) FROM int_test", in order to test the
datum/aggregate case. Just suggestions.


I believe the 'copy to /dev/null' achieves the same thing.


If you really wanted to make the patch look good, a sort with 5GB of
work_mem is the best way, FWIW. The heap data structure used by the
master branch tuplesort.c will handle that very badly. You use no
temp_tablespaces here. I wonder if the patch would do better with
that. Sorting can actually be quite I/O bound with the patch
sometimes, where it's usually CPU/memory bound with the heap,
especially with lots of work_mem. More importantly, it would be more
informative if the temp_tablespace was not affected by I/O from
Postgres' heap.

I'll consider testing that. However, I don't think there was anysignificant I/O on the machines - particularly not on the Xeon, whichhas 16GB of RAM. So the temp files should fit into that quite easily.

The i5 machine has only 8GB of RAM, but it has 6 SSD drives in raid0. SoI doubt it was I/O bound.


I also like seeing a sample of "trace_sort = on" output. I don't
expect you to carefully collect that in every case, but it can tell
us a lot about what's really going on when benchmarking.


Sure, I can collect that.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Using quicksort for every external sort run

Reply via email to