On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <p...@heroku.com> wrote: > I attach a new revision.
I think that I may have missed a trick here. It turns out that it isn't expensive to also hash original text values, to track their cardinality using HyperLogLog - it's hardly measurable when done at an opportune point just after strxfrm() expansion. I was already tracking the cardinality of poor man's normalized keys using HyperLogLog. If I track the cardinality of both sets (original values and normalized keys), I can compare the two when evaluating if normalization should be aborted. This is by far the most important consideration. This causes the optimization to be applied when sorting millions of tuples with only a tiny number of distinct values (like, 4 or 5), without making bad cases that we fail to abort in a timely manner any more likely. This is still significantly profitable - over 90% faster in one test, because the optimistic memcmp() still allows us to avoid any strcoll() calls. It looks about the same as using the "C" collation. Not quite the huge boost we can see, but still well worthwhile. In general it seems well principled to have the "should I abort normalization?" algorithm mostly weigh how effective a proxy for full key cardinality normalized key cardinality is. If it is a good proxy then nothing else matters. If it's not a very good proxy, that can only be because there are many differences beyond the first 8 bytes. Only then will we weigh the total number of distinct normalized keys, and as the effectiveness of normalized key cardinality as a proxy for overall cardinality falls, our requirements for the overall number of distinct normalized keys shoots up rapidly. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers