On Sun, Mar 22, 2015 at 6:48 PM, Andrew Gierth <and...@tao11.riddles.org.uk> wrote: > Your version would have aborted abbrevation on that second query, thus > losing a 3 second speedup. How on earth is that supposed to be > justified? It's not like there's any realistically possible case where > your version performs faster than mine by more than a tiny margin.
As I said, that's really why you won the argument on this particular point. Why are you still going on about it? > Peter> Where I think your argument is stronger is around the cost of > Peter> actually aborting in particular (even though not much work is > Peter> thrown away). Scanning through the memtuples array once more > Peter> certainly isn't free. > > Yes, the cost of actually aborting abbreviation goes up with the number > of nulls. But your version of the code makes that WORSE, by making it > more likely that we will abort unnecessarily. > > If we use the raw value of memtuplecount for anything, it should be to > make it LESS likely that we abort abbreviations (since we'd be paying a > higher cost), not more. But benchmarking doesn't suggest that this is > necessary, at least not for numerics. > > Peter> And you could take the view that it's always worth the risk, > Peter> since it's at most a marginal cost saved if the first ~10k > Peter> tuples are representative, but a large cost incurred when they > Peter> are not. So on reflection, I am inclined to put the check for > Peter> non-null values back in. > > Right result but the wrong reasoning. I think that there is an argument to be made against abbreviation when we simply have a small list of strings to begin with (e.g. 50), as per the glibc strxfrm() docs (which, as I said, may not apply with numeric to the same extent). It didn't end up actually happening that way for the text opclass for various reasons, mostly because the cost model was already complicated enough. Ideally, the number of comparisons per key is at least 10 when we abbreviate with text, which works out at about 1,000 tuples (as costed by cost_sort()). For that reason, leaving aside the risk of aborting when the sampled tuples are not representative (which is a big issue, that does clearly swing things in favor of always disregarding NULLs), having few actual values to sort is in theory a reason to not encode/abbreviate. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers