Hi, I am currently working on DERBY-2967. This issue involves determining the behavior of LIKE operation for character string types in a territory based database. I plan on implementing the findings of DERBY-2967 but wanted to provide a high-level behavior description here to see if anyone has any feedback on the proposed behavior.
Unicode has a concept of Contraction where a user might perceive more than one character as a single character in a given language. One eg of this would be 'AA' in Norwegian locale. Although, this Contraction is made of 2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as a single character. In addition, in Norwegian, the collation elements for 'AA' are identical to collation elements for 'Å'. So, the question is what should SQL operation 'AA' like 'Å' return? Also, is that behavior same as SQL operation 'AA' = 'Å' ? Say there is a table t1 with 2 rows in it. Now taking an eg based on table t1 for LIKE select * from t1 where 'AA' like 'Å' The query above will return no rows because as per SQL standard, LIKE does not treat Contractions as one single character. Instead, we work with one Unicode character at a time in LIKE. So, for the given eg, the first Unicdoe character in 'AA' which is 'A' does not match with 'Å' and hence no rows would be returned. Now taking an eg based on table t1 for = select * from t1 where 'AA' = 'Å' The query above would return 2 rows because as per SQL standard, in = operation, we take the entire character string and use the locale to generate the collation elements for that string. And then we compare the collation elements for the 2 sides of the = operation. Since the 'AA' and 'Å' both result into identical collation elements, the where clause ends up being TRUE and hence we return 2 rows. Expanding further on LIKE operation, the metacharacter _ will match exactly one Unicode character(and not Contractions). For eg select * from t1 where 'AA' like '_' will return 0 rows because _ consumes only the first A in 'AA' and we still have one A left which is not consumed by the pattern. In order for the WHERE clause to be true, the SQL should look like select * from t1 where 'AA' like '__'. Same rule applies for metacharacter %. It will treat 'AA' in LIKE as 2 distinct characters. For eg, select * from t1 where 'AAB' like '%AB' will return 2 rows because % consumed only the first A (and not the Contraction AA). Another eg select * from t1 where 'BAAC' like 'BA%C' will return 2 rows because % consumed the 2nd A. I hope this mail helps understand the behavior of LIKE in a territory based database. I plan to work on implementing this behavior in next few days but please let me know if anyone has any questions that I can help clarify. thanks, Mamta
