Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Amit Khandekar Thu, 08 Jun 2017 23:29:08 -0700

On 2 June 2017 at 23:52, Peter Geoghegan <p...@bowt.ie> wrote:
> On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar <amitdkhan...@gmail.com> 
> wrote:
>> Ok. I was thinking we are doing the tie-breaker because specifically
>> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
>> that we do that to be compatible with texteq().
>
> Both of these explanations are correct, in a way. See commit 656beff.
>
>> Secondly, I was also considering if ICU especially has a way to
>> customize an ICU locale by setting some attributes which dictate
>> comparison or sorting rules for a set of characters. I mean, if there
>> is such customized ICU locale defined in the system, and we use that
>> to create PG collation, I thought we might have to strictly follow
>> those rules without a tie-breaker, so as to be 100% conformant to ICU.
>> I can't come up with an example, or may there isn't one, but , say ,
>> there is a locale which is supposed to sort only by lowest comparison
>> strength (de@strength=1 ?? ). In that case, there might be many
>> characters considered equal, but PG < operator or > operator would
>> still return true for those chars.
>
> In the terminology of the Unicode collation algorithm, PostgreSQL
> "forces deterministic comparisons" [1]. There is a lot of information
> on the details of that within the UCA spec.
>
> If we ever wanted to offer a case insensitive collation feature, then
> we wouldn't necessarily have to do the equivalent of a full strxfrm()
> when hashing, at least with collations controlled by ICU. Perhaps we
> could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY
> to build binary sort keys, and leave the rest to a ucol_equal() call
> (within texteq()) that has the usual UCOL_STRENGTH for the underlying
> PostgreSQL collation.
>
> I don't think it would be possible to implement case insensitive
> collations by using some pre-existing ICU collation that is case
> insensitive. Instead, an implementation might directly vary collation
> strength of any given collation to achieve case insensitivity.
> PostgreSQL would know that this collation was case insensitive, so
> regular collations wouldn't need to change their
> behavior/implementation (to use ucol_equal() within texteq(), and so
> on).


Ah ok. Understood, thanks.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

Reply via email to