Hi,

I am currently working on DERBY-2967. This issue involves determining the
behavior of LIKE operation for character string types in a territory based
database. I plan on implementing the findings of DERBY-2967 but wanted to
provide a high-level behavior description here to see if anyone has any
feedback on the proposed behavior.

Unicode has a concept of Contraction where a user might perceive more than
one character as a single character in a given language. One eg of this
would be 'AA' in Norwegian locale. Although, this Contraction is made of
2 Unicode characters, 'A' and 'A', a Norwegian user perceives them as
a single character. In addition, in Norwegian, the collation elements for
'AA' are identical to collation elements for 'Å'. So, the question is what
should SQL operation 'AA' like 'Å'  return? Also, is that behavior same
as SQL operation 'AA' = 'Å' ?

Say there is a table t1 with 2 rows in it.

Now taking an eg based on table t1 for LIKE
select * from t1 where 'AA' like 'Å'
The query above will return no rows because as per SQL standard, LIKE does
not treat Contractions as one single character. Instead, we work with one
Unicode character at a time in LIKE. So, for the given eg, the first Unicdoe
character in 'AA' which is 'A' does not match with 'Å' and hence no rows
would be returned.

Now taking an eg based on table t1 for =
select * from t1 where 'AA' = 'Å'
The query above would return 2 rows because as per SQL standard, in =
operation, we take the entire character string and use the locale to
generate the collation elements for that string. And then we compare the
collation elements for the 2 sides of the = operation. Since the 'AA' and
'Å' both result into identical collation elements, the where clause ends up
being TRUE and hence we return 2 rows.

Expanding further on LIKE operation, the metacharacter _ will match exactly
one Unicode character(and not Contractions). For eg select * from t1 where
'AA' like '_' will return 0 rows because _ consumes only the first A in 'AA'
and we still have one A left which is not consumed by the pattern. In order
for the WHERE clause to be true, the SQL should look like select * from t1
where 'AA' like '__'. Same rule applies for metacharacter %. It will treat
'AA' in LIKE as 2 distinct characters. For eg, select * from t1 where 'AAB'
like '%AB' will return 2 rows because % consumed only the first A (and not
the Contraction AA). Another eg select * from t1 where 'BAAC' like 'BA%C'
will return 2 rows because % consumed the 2nd A.

I hope this mail helps understand the behavior of LIKE in a territory based
database. I plan to work on implementing this behavior in next few days but
please let me know if anyone has any questions that I can help clarify.

thanks,
Mamta

Reply via email to