Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Peter Geoghegan Fri, 02 Jun 2017 11:23:46 -0700

On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar <amitdkhan...@gmail.com> wrote:
> Ok. I was thinking we are doing the tie-breaker because specifically
> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
> that we do that to be compatible with texteq().


Both of these explanations are correct, in a way. See commit 656beff.

> Secondly, I was also considering if ICU especially has a way to
> customize an ICU locale by setting some attributes which dictate
> comparison or sorting rules for a set of characters. I mean, if there
> is such customized ICU locale defined in the system, and we use that
> to create PG collation, I thought we might have to strictly follow
> those rules without a tie-breaker, so as to be 100% conformant to ICU.
> I can't come up with an example, or may there isn't one, but , say ,
> there is a locale which is supposed to sort only by lowest comparison
> strength (de@strength=1 ?? ). In that case, there might be many
> characters considered equal, but PG < operator or > operator would
> still return true for those chars.

In the terminology of the Unicode collation algorithm, PostgreSQL
"forces deterministic comparisons" [1]. There is a lot of information
on the details of that within the UCA spec.

If we ever wanted to offer a case insensitive collation feature, then
we wouldn't necessarily have to do the equivalent of a full strxfrm()
when hashing, at least with collations controlled by ICU. Perhaps we
could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY
to build binary sort keys, and leave the rest to a ucol_equal() call
(within texteq()) that has the usual UCOL_STRENGTH for the underlying
PostgreSQL collation.

I don't think it would be possible to implement case insensitive
collations by using some pre-existing ICU collation that is case
insensitive. Instead, an implementation might directly vary collation
strength of any given collation to achieve case insensitivity.
PostgreSQL would know that this collation was case insensitive, so
regular collations wouldn't need to change their
behavior/implementation (to use ucol_equal() within texteq(), and so
on).

[1] http://unicode.org/reports/tr10/#Forcing_Deterministic_Comparisons
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Reply via email to