On Thu, Jun 12, 2014 at 2:09 PM, Peter Geoghegan <p...@heroku.com> wrote:
> Thanks for looking into this.

Is anyone going to look at this?

I attach a new revision. The only real change to the code is that I
fixed an open item concerning what to do on WIN32 with UTF-8, where
the UTF-16 hacks that we do there cannot be worked around to make the
optimization still work, and yet we're still obligated to set up a
sort support state since there is a cataloged sort support function
(unfortunately I wasn't able to test it on Windows since I no longer
own a Windows machine, but it should be fine). I set up a comparison
shim within varlena.c for that case only. This is a kludge, but I
decided that it was better than preparing the sort support
infrastructure to not get a sane sort support state from all opclasses
just because of some platform-specific issue. I think that's justified
by the fact that it's very unlikely to happen again. text is a
datatype that is unusually tied to the operating system. This does
necessitate having the sort support struct record which sort operator
was originally used to lookup the sort support function, but it seems
reasonable to store that in all instances.

What may be of more interest to reviewers is the revised AC_TRY_RUN
test program that "configure" consults. While the code is unchanged,
there is now a detailed rationale for why it's reasonable to expect
that a significant amount of entropy will be concentrated in the first
8 bytes of a strxfrm() blob, with reference to how those algorithms
are more or less required to behave by the Unicode consortium. The
basic reason is that the algorithm for building binary sort keys
(described by a Unicode standard [1] defining the behavior of Unicode
collation algorithms, and implemented by strxfrm()) specifically works
by storing primary level, secondary level and tertiary level weights
in turn in the returned blob. There may be additional levels, too. As
the standard says:

The primary level (L1) is for the basic sorting of the text, and the
non-primary levels (L2..Ln) are for adjusting string weights for other
linguistic elements in the writing system that are important to users
in ordering, but less important than the order of the basic sorting.

Robert pointed out a case where varying character case of an English
word did not alter the primary level bytes (and thus the poor man's
normalized key was the same). He acknowledged that most of the entropy
of the first 8 bytes of the string went into the first 8 bytes of the
blob/key. This can actually be very useful to the optimization in some
cases. In particular, with most Latin alphabets you'll see the same
pattern when diacritics are used. This means that even though the
original string has (say) an accented character that would take 2
bytes to store in UTF-8, the weight in the primary level is the same
as an equivalent unaccented character (and so only takes one byte to
store at that level, with differences only in subsequent levels).
Whole strxfrm() blobs are the same length regardless of how many
accents appear in otherwise equivalent original Latin string, and you
get exactly the same high concentrations of entropy in the first 8
bytes in pretty much all Latin alphabets (the *absence* of diacritics
is stored in later weight levels too, even with the "en_US.UTF-8"
collation). As the standard says:

By default, the algorithm makes use of three fully-customizable
levels. For the Latin script, these levels correspond roughly to:

alphabetic ordering
diacritic ordering
case ordering.

A final level may be used for tie-breaking between strings not
otherwise distinguished.

In practice a huge majority of comparisons are expected to be resolved
at the primary weight level (in the case of a straight-up sort of
complete, all-unique strxfrm() blobs), which at least in the case of
glibc appear to require (at most) as many bytes to store as an
original UTF-8 string did. Performance of a sort using a
strcoll()-based collation could be *faster* than the "C" location with
this patch if there were plenty of Latin characters with diacritics.
The diacritics would effectively be removed from the poor man's
normalized key representations, and would only be considered when a
tie-breaker is required.

[1] http://www.unicode.org/reports/tr10/
Peter Geoghegan

Attachment: poorman-hll-win32.2014_07_26.patch.gz
Description: GNU Zip compressed data

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to