On 28 Jul 2014, at 5:23 PM, Peter Geoghegan wrote:
> On Mon, Jul 28, 2014 at 5:14 PM, Wim Lewis <w...@omnigroup.com> wrote:
>> A quick glance at OSX's strxfrm() suggests they're using an implementation 
>> of strxfrm() from FreeBSD. You can find the source here:
>> http://www.opensource.apple.com/source/Libc/Libc-997.90.3/string/FreeBSD/strxfrm.c
>> (and a really quick glance at the contents of libc on OSX 10.9 reinforces 
>> this--- I don't see any calls into their CoreFoundation unicode string APIs.)
> Something isn't quite accounted for, then. The FreeBSD behavior is to
> append the primary weights only. That makes their returned blobs
> smaller than those you'll see on Linux, but also appears to imply that
> their implementation is substandard (The PostgreSQL port uses ICU on
> FreeBSD for a reason, I suppose). But FreeBSD did not add extra,
> redundant "header bytes" right in the primary level when I tested it,
> but I'm told Mac OS X does.

I don't think OSX actually does. From a look at the source and a simple test 
program, OSX's strxfrm represents its output as a series of 24-bit weights, 
each one encoded into four bytes. Multiple "levels" of weights are separated by 
a weight of 0 (which is encoded as the four bytes "0000").

Robert Haas' message in this thread on 7 April is the first mention of a 
header, but his examples from 12 June don't really demonstrate a header--- 
they're completely consistent with the description above and the published 
source code.

OSX's strxfrm output is very space-inefficient, especially for Latin text, 
where none of the weights seem to be greater than 2^12, meaning that half of 
the bytes in any output are always going to be 0x30. (Testing with, e.g., 
Hangul characters I find some weights that use more of the space.) But as far 
as I can tell it's not completely crazy. :)

What this means is that on OSX, comparing the first 8 bytes of strxfrm output 
will end up comparing the primary weights of the first two characters in the 
original string. Which is the same conclusion Robert Haas came to earlier, 
modulo the "header bytes" interpretation.

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to