On Sun, Jul 27, 2014 at 12:00 AM, Peter Geoghegan <p...@heroku.com> wrote:
> I attach a new revision.

I think that I may have missed a trick here.

It turns out that it isn't expensive to also hash original text
values, to track their cardinality using HyperLogLog - it's hardly
measurable when done at an opportune point just after strxfrm()
expansion. I was already tracking the cardinality of poor man's
normalized keys using HyperLogLog. If I track the cardinality of both
sets (original values and normalized keys), I can compare the two when
evaluating if normalization should be aborted. This is by far the most
important consideration.

This causes the optimization to be applied when sorting millions of
tuples with only a tiny number of distinct values (like, 4 or 5),
without making bad cases that we fail to abort in a timely manner any
more likely. This is still significantly profitable - over 90% faster
in one test, because the optimistic memcmp() still allows us to avoid
any strcoll() calls. It looks about the same as using the "C"
collation. Not quite the huge boost we can see, but still well

In general it seems well principled to have the "should I abort
normalization?" algorithm mostly weigh how effective a proxy for full
key cardinality normalized key cardinality is. If it is a good proxy
then nothing else matters. If it's not a very good proxy, that can
only be because there are many differences beyond the first 8 bytes.
Only then will we weigh the total number of distinct normalized keys,
and as the effectiveness of normalized key cardinality as a proxy for
overall cardinality falls, our requirements for the overall number of
distinct normalized keys shoots up rapidly.

Peter Geoghegan

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to