On Tue, Mar 29, 2016 at 12:43 PM, Peter Geoghegan <p...@heroku.com> wrote: > A complete do-over from Tomas would be best, here. He has already > acknowledged that the i5 CREATE INDEX results were completely invalid.
The following analysis is all based on Xeon numbers, which as I've said we should focus on pending a do-over from Tomas. Especially important here is the larget set -- the 10M numbers from results-xeon-10m.ods. I think that abbreviation distorts things here. We also see distortion from "padding" cases. Rather a lot of "padding" is used, FWIW. From Tomas' script: INSERT INTO numeric_test_padding SELECT a, repeat(md5(a::text),10) FROM data_float ORDER BY a; This makes the tests have TOAST overhead. Some important observations on results-xeon-10m: * There are almost no regressions for types that don't use abbreviation. There might be one exception when there is both padding and presorted input -- the 32MB high_cardinality_almost_asc/high_cardinality_sorted/unique_sorted "SELECT * FROM int_test_padding ORDER BY a", which takes 26% - 35% longer (those are all basically the same cases). But it's a big win in the high_cardinality_random, unique_random, and even unique_almost_asc categories, or when DESC order was requested in all categories (I note that there is certainly an emphasis on pre-sorted cases in the choice of categories). Other than that, no regressions from non-abbreviated types. * No CREATE INDEX case is ever appreciably regressed, even with maintenance_work_mem at 8MB, 1/4 of its default value of 64MB. (Maybe we lose 1% - 3% with the other (results-xeon-1m.ods) cases, where maintenance_work_mem is close to or actually high enough to get an internal sort). It's a bit odd that "CREATE INDEX x ON text_test_padding (a)" is about a wash for high_cardinality_almost_asc, but I think that's just because we're super I/O bound for this presorted case, but cannot make up for it with quicksort's "bubble sort best case" precheck for presortedness, so replacement selection does better in a way that might even result in a clean draw. CREATE INDEX looks very good in general. I think abbreviation might abort in one or two cases for text, but the picture for the patch is still solid. * "Padding" can really distort low-end cases, that become more about moving big tuples around than actual sorting. If you really want to see how high_cardinality_almost_asc queries like "SELECT * FROM text_test_padding ORDER BY a" are testing the wrong thing, consider the best and worst case for the master branch with any amount of work_mem. The 10 million tuple high_cardinality_almost_asc case takes 40.16 seconds, 39.95 seconds, 40.98 seconds, and 41.28 seconds, and 42.1 seconds for respective work_mems of 8MB, 32MB, 128MB, 512MB, and 1024MB. This is a very narrow case because it totally deemphasizes comparison cost and emphasizes moving tuples around, involves abbreviation of text with a merge phase that cannot use abbreviation that only the patch has due to RS best-case on master. The case is seriously short changed by the memory batching refund thing in practice. When is *high cardinality text* (not dates or something) ever likely to be found in pre-sorted order for 10 million tuples in the real world? Besides, we just stopped trusting strxfrm(), so the case would probably be a wash now at worst. * The more plausible padding + presorted + abbreviation case that is sometimes regressed is "SELECT * FROM numeric_test_padding ORDER BY a". But that's regressed a lot less than the aforementioned "SELECT * FROM text_test_padding ORDER BY a" case, and only at the low end. It is sometimes faster where the original case I mentioned is slower. * Client overhead may distort things in the case of queries like "SELECT * FROM foo ORDER BY bar". This could be worse for the patch, which does relatively more computation during the final on-the-fly merge phase (which is great when you can overlap that with I/O; perhaps not when you get more icache misses with other computation). Aside from just adding a lot of noise, this could unfairly make the patch look a lot worse than master. Now, I'm not saying all of this doesn't matter. But these are all fairly narrow, pathological cases, often more about moving big tuples around (in memory and on disk) than about sorting. These regressions are well worth it. I don't think I can do any more than I already have to fix these cases; it may be impossible. It's a very difficult thing to come up with an algorithm that's unambiguously better in every possible case. I bent over backwards to fix low-end regressions already. In memory rich environments with lots of I/O bandwidth, I've seen this patch make CREATE INDEX ~2.5x faster for int4, on a logged table. More importantly, the patch makes setting maintenance_work_mem easy. Users' intuition for how sizing it ought to work now becomes more or less correct: In general, for each individual utility command bound by maintenance_work_mem, more memory is better. That's the primary value in having tuple sorting be cache oblivious for us; the smooth cost function of sorting makes tuning relatively easy, and gives us a plausible path towards managing local memory for sorting and hashing dynamically for the entire system. I see no other way for us to get there. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers