[jira] [Commented] (DERBY-6607) Derby is using territory/collation for equality, not just ordering (incorrectly?)

Knut Anders Hatlen (JIRA) Wed, 11 Jun 2014 02:06:18 -0700

    [ 
https://issues.apache.org/jira/browse/DERBY-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14027562#comment-14027562
 ]


Knut Anders Hatlen commented on DERBY-6607:
-------------------------------------------

I believe Derby is behaving correctly by using collation for equality. 
SQL:2011, part 2, 9.11 Equality operations, syntax rule 4 says: "Let VS be the 
set of declared types of the operands of an equality operation. If VS comprises 
character string types, then the Syntax Rules of Subclause 9.15, “Collation 
determination”, are applied with VS as TYPESET; let the collation to be used in 
the equality operation be the COLL returned from the application of those 
Syntax Rules."

{quote}
In terms of collation (ordering), Japanese consider these to be equal.  So, in 
the following Java code, the call to 'compare()' would return 0:
{code:java}
Collator collator = Collator.getInstance(Locale.JAPAN);
collator.setStrength(Collator.PRIMARY);
return collator.compare("ケーキ", "けーき");
{code}
{quote}

FWIW, the above code returns -1 in my environment (Debian 7.5, JDK 8u5). It 
might be that the Japanese signs get garbled when I copy them from the bug 
report (they look okay to me, though). Unicode escape codes might be more 
reliable.

{quote}
The following SQL should delete both cats:
{code:sql}
DELETE FROM tags WHERE tag='cAT'
{code}

But from the Japanese perspective, the following code would _erroneously_ 
delete both cakes:
{code:sql}
DELETE FROM tags WHERE tag='ケーキ'
{code}
{quote}

In my environment, it seems to behave the way you want:

{noformat}
ij version 10.11
ij> connect 
'jdbc:derby:memory:db;create=true;territory=ja_JP;collation=TERRITORY_BASED:PRIMARY';
ij> create table tags(tag varchar(10));
0 rows inserted/updated/deleted
ij> insert into tags values 'Cat', 'cat', 'ケーキ', 'けーき';
4 rows inserted/updated/deleted
ij> delete from tags where tag = 'cAT';
2 rows inserted/updated/deleted
ij> delete from tags where tag='ケーキ';
1 row inserted/updated/deleted
ij> select * from tags;
TAG       
----------
けーき       

1 row selected
{noformat}

{quote}
They consider the two expressions of the word cake distinct, but consider the 
two cats as equivalent.  The Collator considers them all equivalent.  It is as 
if {{String.equals()}} should be used if the _lvalue_ _or_ _rvalue_ of an = 
operator is Japanese, and use {{Collator.equals()}} if the _lvalue_ _and_ 
_rvalue_ are "ascii-betical".
{quote}

Maybe you could achieve that by creating a custom collator which treats 
case-differences in Latin letters as secondary differences and 
hiragana/katakana as primary differences, or something of that sort. 
http://db.apache.org/derby/docs/10.10/devguide/tdevdvlpcustomcollation.html 
shows an example of how to create a custom collator.

> Derby is using territory/collation for equality, not just ordering 
> (incorrectly?)
> ---------------------------------------------------------------------------------
>
>                 Key: DERBY-6607
>                 URL: https://issues.apache.org/jira/browse/DERBY-6607
>             Project: Derby
>          Issue Type: Bug
>          Components: Localization
>    Affects Versions: 10.10.2.0
>            Reporter: Brett Wooldridge
>
> We have a database where we wish case-insensitivity, and therefore it was 
> created with collation=TERRITORY_BASED:PRIMARY.  We have customers in both 
> the United States (en_US) and in Japan (ja_JP).
> We have an issue in Japan.  Japanese has three character sets: hiragana, 
> katakana, and kanji.  Hiragana is a phonetic alphabet with 46 letters.  
> Katakana is an identical phonetic alphabet with 46 letters, written using 
> different character forms, and used for foreign words (words adopted from 
> other languages into Japanese).
> Here is the word 'cake' written in katakana: ケーキ (ke- ki)
> Here is the word 'cake' written in hiragana: けーき  (ke- ki)
> In terms of collation (ordering), Japanese consider these to be equal.  So, 
> in the following Java code, the call to 'compare()' would return 0:
> {code:java}
> Collator collator = Collator.getInstance(Locale.JAPAN);
> collator.setStrength(Collator.PRIMARY);
> return collator.compare("ケーキ", "けーき");
> {code}
> And therein lies the issue.  With respect to _ordering_ they are indeed 
> equivalent, however Japanese would consider them district  (non-equivalent) 
> values.
> When a table is declared with a UNIQUE constraint on a column, or a PRIMARY 
> KEY column, if 'ケーキ' exists in the table, Derby will throw a unique 
> constraint violation upon an attempt to insert 'けーき'.
> We need collation=TERRITORY_BASED:PRIMARY or TERRITORY_BASED:SECONDARY for 
> case-insensitivity _and_ at the same time need these values to be treated as 
> unique.  It is as if {{String.equals()}} should be used if the _lvalue_ or 
> _rvalue_ of an = operator is Japanese, but should use {{Collator.equals()}} 
> if both the _lvalue_ and _rvalue_ are "ascii-betical".  The same for 
> constraint checking.
> Is it "correct" that Derby use the collation when determining value 
> equivalency vs. ordering equivalency?
> At the same time, I understand that this is tricky.  Japanese has no 
> "upper-case" and "lower-case" for hiragana, katakana, or kanji, however they 
> do use "romanji" (roman characters) which are essentially ASCII, which is 
> case-sensitive.  Collation is merely used for ordering.  So when  
> TERRITORY_BASED:PRIMARY/SECONDARY is used, for Japanese, 'cat' and 'CAT' 
> would be equivalent but 'ケーキ' and 'けーき' _would not be_.  Unfortunately, there 
> is only one Collator and it will identify _both_ of these as equivalent.
> Taking the example further, imagine a database with 
> collation=TERRITORY_BASED:SECONDARY, and _tags_ table without a unique 
> constraint, but containing the following values:
> {code:java}
> Tag
> -----------------------
> Cat
> cat
> ケーキ
> けーき
> {code}
> The following SQL should delete both cats:
> {code:sql}
> DELETE FROM tags WHERE tag='cAT'
> {code}
> But from the Japanese perspective, the following code would _erroneously_ 
> delete both cakes:
> {code:sql}
> DELETE FROM tags WHERE tag='ケーキ'
> {code}
> They consider the two expressions of the word cake distinct, but consider the 
> two cats as equivalent.  The Collator considers them all equivalent.  It is 
> as if {{String.equals()}} should be used if the _lvalue_ _or_ _rvalue_ of an 
> = operator is Japanese, and use {{Collator.equals()}} if the _lvalue_ _and_ 
> _rvalue_ are "ascii-betical".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (DERBY-6607) Derby is using territory/collation for equality, not just ordering (incorrectly?)

Reply via email to