On Tue, Mar 22, 2016 at 3:35 PM, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote: > I've tested the patch you've sent on 2016/3/11, which I believe is the last > version. I haven't tuned the replacement_sort_mem at all. because my > understanding was that it's not a 9.6 material (per your message). So my > intent was to test the configuration people are likely to use by default.
I meant that using replacement selection in a special way with CREATE INDEX was not 9.6 material. But replacement_sort_mem is. And so, any case with the (maintenance)_work_mem <= 16MB will have used a heap for the first run. I'm sorry I did not make a point of telling you this. It's my fault. The result in any case is that pre-sorted cases will be similar with and without the patch, since replacement selection can thereby make one long run. But on non-sorted cases, the patch helps less because it is in use less -- with not so much data overall, possibly much less (which I think explains why the 1M row tests seem so much less interesting than the 10M row tests). I worry that at the low end, replacement_sort_mem makes the patch have one long run, but still some more other runs, so merging is unbalanced. We should consider if the patch can beat the master branch at the low end without using a replacement selection heap. It would do better in at least some cases in low memory conditions, possibly a convincing majority of cases. I had hoped that my recent idea (since committed) of resetting memory contexts would help a lot with regressions when work_mem is very low, and that particular theory isn't really tested here. > I'm not sure which commit are you referring to. The benchmark was done on > a414d96a (from 2016/3/10). However I'd expect that to affect both sets of > measurements, although it's possible that it affects the patched version > differently. You did test the right patches. It just so happens that the master branch now has the memory batching stuff now, so it doesn't get credited with that. I think this is good, though, because we care about 9.5 -> 9.6 regressions. Improvement ratio (master time/patched time) for Xeon 10 million row case "SELECT * FROM int_test_padding ORDER BY a DESC": For work_mem of 8MB = 0.83, 32MB = 0.62, 128MB = 0.52, 512MB = 0.47, 1024MB = 1.00 So, it gets faster than the master branch as more memory is available, but then it goes to 1.00 -- a perfect draw. I think that this happened simply because at that point, the sort was an internal sort (even though similar CREATE INDEX case did not go internal at the same point). The (internal) 1024MB case is not that much faster than the 512MB external case, which is pretty good. There are also "near draws", where the ratio is 0.98 or so. I think that this is because abbreviation is aborted, which can be a problem with synthetic data + text -- you get a very slow sort either way, where most time is spent calling strcoll(), and cache characteristics matter much less. Those cases seemingly take much longer overall, so this theory makes sense. Unfortunately, abbreviated keys for text that is not C locale text was basically disabled across the board today due to a glibc problem. :-( Whenever I see that the patch is exactly as fast as the master branch, I am skeptical. I am particularly skeptical of all i5 results (including 10M cases), because the patch seems to be almost perfectly matched to the master branch for CREATE INDEX cases (which are the best cases for the patch on your Xeon server) -- it's much easier to believe that there was a problem during the test, honestly, like maintenance_work_mem wasn't set correctly. Those two things are so different that I have a hard time imagining that they'd ever really draw. I mean, it's possible, but it's more likely to be a problem with testing. And, queries like "SELECT * FROM int_test_padding ORDER BY a DESC" return all rows, which adds noise from all the client overhead. In fact, you often see that adding more memory helps no case here, so it seem a bit pointless. Maybe they should be written like "SELECT * FROM (select * from int_test_padding ORDER BY a DESC OFFSET 1e10) ff" instead. And maybe queries like "SELECT DISTINCT a FROM int_test ORDER BY a" would be better as "SELECT COUNT(DISTINCT a) FROM int_test", in order to test the datum/aggregate case. Just suggestions. If you really wanted to make the patch look good, a sort with 5GB of work_mem is the best way, FWIW. The heap data structure used by the master branch tuplesort.c will handle that very badly. You use no temp_tablespaces here. I wonder if the patch would do better with that. Sorting can actually be quite I/O bound with the patch sometimes, where it's usually CPU/memory bound with the heap, especially with lots of work_mem. More importantly, it would be more informative if the temp_tablespace was not affected by I/O from Postgres' heap. I also like seeing a sample of "trace_sort = on" output. I don't expect you to carefully collect that in every case, but it can tell us a lot about what's really going on when benchmarking. Thanks -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers