Re: [HACKERS] Dealing with collation and strcoll/strxfrm/etc

Peter Geoghegan Mon, 28 Mar 2016 11:34:42 -0700

On Mon, Mar 28, 2016 at 7:57 AM, Stephen Frost <[email protected]> wrote:
> If we're going to talk about minimum requirements, I'd like to argue
> that we require whatever system we're using to have versioning (which
> glibc currently lacks, as I understand it...) to avoid the risk that
> indexes will become corrupt when whatever we're using for collation
> changes.  I'm pretty sure that's already bitten us on at least some
> RHEL6 -> RHEL7 migrations in some locales, even forgetting the issues
> with strcoll vs. strxfrm.


I totally agree that anything we should adopt should support
versioning. Glibc does have a non-standard versioning scheme, but we
don't use it. Other stdlibs may do versioning another way, or not at
all. A world in which ICU is the defacto standard for Postgres (i.e.
the actual standard on all major platforms), we mostly just have one
thing to target, which seems like something to aim for.

Collations change from time to time, legitimately. Read from
"Collation order is not fixed", here:

http://unicode.org/reports/tr10/#Stability

The question is only how we deal with this when it happens. One thing
that's attractive about ICU is that it makes this explicit, both for
the logical behavior of a collation, as well as the stability of
binary sort keys (Glibc's versioning seemingly just does the former).
So the equivalent of strxfrm() output has license to change for
technical reasons that are orthogonal to the practical concerns of
end-users about how text sorts in their locale. ICU is clear on what
it takes to make binary sort keys in indexes work. And various major
database systems rely on this being right.

> Regarding key abbreviation and performance, if we are confident that
> strcoll and strxfrm are at least independently internally consistent
> then we could consider offering an option to choose between them.

I think they just need to match, per the standard. After all,
abbreviation will sometimes require strcoll() tie-breakers.

Clearly it would be very naive to imagine that ICU is bug-free.
However, I surmise that there is a large difference how ICU and glibc
think about things like strxfrm() or strcoll() stability and
consistency. Tom was able to demonstrate that strxfrm() and strcoll()
behaved inconsistently without too much effort, contrary to POSIX, and
in many common cases. I doubt that the Glibc maintainers are all that
concerned about it. Certainly, less concerned than they are about the
latest security bug. Whereas if this happened in ICU, it would be a
total failure of the project to fulfill its most basic goals. Our
disaster would also be a disaster for several other major database
systems. ICU carefully and explicitly considers multiple forms of
stability, "deterministic" sort ordering, etc. That *is* a big
difference, and it makes me optimistic that there'd be far fewer
problems.

I also think that ICU could be a reasonable basis for case-insensitive
collations, which would let us kill citext, a module that I consider
to be a total kludge. And, we might also be able to lock down WAL
compatibility, which would be generally useful.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Dealing with collation and strcoll/strxfrm/etc

Reply via email to