Re: [Firebird-devel] Case (and accent) insensitive ICU collations in multiple columns

nonomura Fri, 19 Oct 2018 04:37:47 -0700

Adriano,

I've read your comment again, carefully, and understood what you werethinking as

"the root of the problem".


But I could not understand how it is relating to the case that I reported.

ICU library generates sort keys for those three collations and Firebirduses it.Firebird generates sort keys for collations other than those of ICU's.Is this correct?


You were talking about the sort key generation in Firebird, right?
I was talking about ICU's.

That's why we were talking past each other, maybe?

Or do you mean that the sort key generated through ICU library isreorganized

in Firebird? If so, please tell me how.

Anyway, before jump to conclusion, please refute this:

Therefore the problem seems to reside in either following A or B:



Regards,
Hiro


(2018/10/19 0:59), Adriano dos Santos Fernandes wrote:

I really don't understand why you emphasize so much on
case-/accent-insensitivity, and say UNICODE is ok, as we can easily
demonstrate the root of the problem of multi-level and multi-segment
with it.

Yes, there is inconsistencies with insensitive with/without unique, but
root of the problem (non-interleaved keys) is better demonstrated with a
case-/accent-sensitive collation.


-- Natural order we expect. Case is less important than base letter.

SQL> create table t1 (c1 varchar(2) character set utf8 collate unicode);
SQL> insert into t1 values ('a1');
SQL> insert into t1 values ('a2');
SQL> insert into t1 values ('A1');
SQL> insert into t1 values ('A2');
SQL>
SQL>
SQL> select * from t1 order by c1;

C1
========
a1
A1
a2
A2


-- But when string is separated in multiple columns, case become more
important than we naturally expect.

SQL> create table t2 (c1 varchar(1) character set utf8 collate unicode,
n1 integer);
SQL> insert into t2 values ('a', 1);
SQL> insert into t2 values ('a', 2);
SQL> insert into t2 values ('A', 1);
SQL> insert into t2 values ('A', 2);
SQL>
SQL> select * from t2 order by c1, n1;

C1               N1
====== ============
a                 1
a                 2
A                 1
A                 2


Adriano


On 18/10/2018 14:37, nonomura wrote:

Hi there,

I would like to ask someone who can check and correct if bug was found
to see the relating part of source code regarding sorting on UTF8 +
ICU collations.

The symptom and what I tested and confirmed have commented in
CORE-5940. The following is the summary:

1. UNICODE_CI and UNICODE_CI_AI are falsely working as if requested by
collate UNICODE even in ordering by multiple columns.
2. As for collate UNICODE, the results were fine in any combination.
3. If the ordering combination was of unique index, the sort is
correctly done in any combination also for UNICODE_CI and UNICODE_CI_AI.
4. Firebird uses ICU's sort key for collate UNICODE, UNICODE_CI,
UNICODE_CI_AI.
5. The sort key is composed of three buckets: Body, Case, Accent.
6. The sort key generated by ICU library for UNICODE_CI_AI is only the
body part and for UNICODE_CI the body and the case.
7. In collate UNICODE_CI and CI_AI, Firebird saves and obviously
compares the three buckets falsely if the ordering combination is not
of unique index.

Therefore the problem seems to reside in either following A or B:

A) In sorting, "compare function" must select the buckets to compare
according to its collate property.
B) If saved sort key is supposed to be fully compared in the function
regardless of the collate property, then collate property must be
considered in creating the sort key.

Thanks in advance.
Regards,
Hiro



Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel



Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel




Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] Case (and accent) insensitive ICU collations in multiple columns

Reply via email to