[jira] Updated: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED

Mamta A. Satoor (JIRA) Tue, 02 Oct 2007 13:26:21 -0700

     [ 
https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mamta A. Satoor updated DERBY-2967:
-----------------------------------

    Attachment: DERBY2967_offset_based_stat_Oct02_07.txt
                DERBY2967_offset_based_diff_Oct02_07.txt

Attaching patch DERBY2967_offset_based_diff_Oct02_07.txt which fixes DERBY-2967 
for both IBM's jdks and Sun's jdks. The fix involves using 
CollationElementIterator to look through collation elements of a character 
string as needed rather than fetching them all into an array before hand. With 
the later approach, we may end up getting all the collation elements for a 
really large string when we will be looking at say only first few of those 
collation elements because we have found a mismatch between value and pattern.

The majority of the changes have gone into iapi.types.Like class. This class 
has two different implementation of SQL LIKE clause. One implementation is for 
UCS_BASIC character strings and the other one is for territory based character 
strings. The territory based implementation of LIKE clause has been changed in 
this patch. From a top level, this method does one to one comparison of 
collation elements for string that is being searched and the pattern that is 
being searched in that string. This one-to-one comparison is done for 
non-metacharacters in the pattern. As soon as we find a metacharacter in the 
pattern string, we do special processing depending on what kind of 
metacharacter we are dealing with.
1)Taking the simplest case of escape character as the metacharacter. If the 
user has provided an escape character, then as soon as we encounter the escape 
character in the pattern string, we check if the character following it is a 
metacharacter or
not. If not, we throw an exception. If the character indeed is a metacharacter, 
we convert that metacharacter into it's collation element(s) and look for the 
exact match of those collation element(s) into the CollationElementIterator for 
the value string.
2)Taking the next simplest case of _. When we find a _ in the pattern string, 
we advance in the CollationElementIterator for the value string by one 
character. This is where it gets tricky ie what is defined as one character in 
a particular locale. In Norwegian locale, 'b' is a single character and so are 
'aa'. Both of these character strings translate into single collation element. 
But in Norwegian locale, evne though '\uFA2D' is one character, it translates 
into 2 collation elements. This advancing by one character and eating the right 
number of collation elements is implemented in a new method in Like.java class 
and the new method's name is advnaceByOne(CollationElementIterator).
3)Moving on to metacharacter %. A % can be satisfied by 0 to any number of 
characters in the value string. In order to determine how many characters will 
satisfy %, we have to start with 0 character and keep eating more and more 
characters until we find a match for the remaining pattern string and value 
string or until we reach end of value string but still non % characters left in 
pattern string. In the later case, we know we have ended up with a mismatch and 
we return a FALSE from the method.

The javadoc(from Like.java's like(CollationElementIterator, int, String, int, 
String, RuleBasedCollator)) describes the implementation in little more detail 
and I have copied that here for reference. Note that in the following method, 
valueIterator is the CollationElementIterator for the character string that we 
are doing a search into.


****************beginning of javadoc*****************
         * This method will be called for character string types with territory
         * based collation to see if valueIterator matches the passed pattern
         * string. The logic of the method is described in following steps (I
         * have tried to put the step identifiers in the code below for 
reference).
         * A)If pattern string or value Iterator is null, then this method will
         *   return null because the results of LIKE can't be established in 
such
         *   a situation.
         * B)Start the loop
         *   a)Check the lengths of valueIterator and pattern string to see if 
it
         *     is time to return with TRUE/FALSE. The exact details of these 
checks 
         *     can be found in the javadoc method of 
         *       checkLengths(CollationElementIterator, String, int)
         *   b)Start looking at pattern where the pointer is pointing and keep
         *     going until you find end of pattern or you find one of the 
         *     metacharacters ie %, * or optional escape character. 
         *   c)Get a CollationElementIterator for the non-metacharacters found 
in 
         *     step b(using the Collator passed to this method. The same 
Collator
         *     was used to construct valueIterator).
         *   d)Make sure that collation elements found in step c) match the 
         *     collation elements in valueIterator. A mismatch would require us 
to 
         *     return FALSE from this method.
         *   e)Check again the lengths of valueIterator and pattern string to 
see 
         *     if it is time toreturn with TRUE/FALSE. The exact details of 
these  
         *     checks can be found in the javadoc method of 
         *       checkLengths(CollationElementIterator, String, int)
         *   f)If we are still here and didn't return from this method as a 
result
         *     of step Be) then it means that we have to deal with the 
         *     metacharacter found in step Bc). Check what metacharacter is the 
         *     offset in pattern pointing to
         *     1)If user has provided escape character and pattern is pointing 
to
         *       one right now, then convert the next character in pattern to 
it's 
         *       collation element(s) and compare those collation elements to 
         *       elements in valueIterator. If they do not match, we need to 
return 
         *       FALSE. 
         *     2)If it is not escape character, then check if it is a '%'. If 
it is,
         *       then increment the pointer in the pattern string by 1 and then 
         *       follow the involved algorithm below
         *       First check if we have reached the end of pattern. If yes, 
then we 
         *       can simply return from this method with TRUE return value. If 
we 
         *       have not reached end of pattern, then check if rest of the 
         *       characters in pattern are all '%'. If yes, then we can simply 
         *       return from this method wil TRUE return value. If not all %, 
then
         *       take rest of the pattern string and see if it matches rest of 
the 
         *       valueIterator(will be implemented by recursively calling this 
         *       method). If no match, then do the step Ba). If it is not time 
to
         *       return because of step Ba), then advance the pointer in 
         *       valueIterator by one character and see if the valueIterator 
now 
         *       matches the rest of the pattern string. Keep going until we 
find 
         *       the match or mismatch.
         *     3)If it is not escape character or %, then check if it is a _. 
If 
         *       yes, then skip all the collation elements in valueIterator 
         *       corresponding to the next character.
         *   g)Go back to step B).
****************end of javadoc*****************

I have changed SQLChar and WorkHorseForCollatorDatatypes to call this new 
method for territory based character string types in Like rather than the old 
implementation (I have removed the old implementation code from Like.java)

I have added a new test in CollationTest. The existing tests for LIKE in 
CollationTest2 have been very handy during my testing of the code changes.

In addition to the above tests, I have run full collation test under Norwegian 
territory. Following is the analysis of some of the test failures and fixes to 
subset of them.

1)There are few existing tests that use character string 'aa' in their testing. 
These existing test were written to run in UCS_BASIC collation and hence 'aa' 
didn't cause a problem. But now when we run the full collation tests with say 
Norwegian territory based locale, the character string 'aa' exhibit different 
behavior(because in Norwegian, it is treated as one character) and hence give 
the false impression of test failures. To avoid having to scan these false 
failures every time one runs full
collation tests with Norwegian territory, I have changed the test data in some 
tests. This has bring the test failures far lower when all the tests are run 
with territory based locale rather than UCS_BASIC.

2)Few other failures in full collation test will be the error message string 
mismatch when running the tests in UCS_BASIC 
vs territory based. For eg, LOB.out has error messages like following
ERROR 42818: Comparisons between 'CHAR (UCS_BASIC)' and 'CLOB (UCS_BASIC)' are 
not supported. Types must be comparable. String types must also have matching 
collation. If collation does not match, a possible solution is to cast operands 
to force them to the default collation (e.g. select tablename from 
sys.systables where CAST(tablename as VARCHAR(128)) = 'T1')
This error message will have TERRITORY BASED rather than UCS_BASIC in it's text 
when the test is run in territory based. So even though it might look like a 
test failure when LOB test in run in territory based locale, it is really not a 
test failure. I think there are 2 tests which will show diffs for this reason. 
They are LOB.sql and implicitConversoins.

3)The diff in orderby.sql is also genuine because with UCS_BASIC database, 
"Canada" sorts before "anaconda" but with 
TERRITORY_BASED database, "anaconda" sorts before "Canada". 

4)The diffs in Nist tests dml068 and dml079 look genuine too because data is 
getting sorted in different order with TERRITORY_BASED database and UCS_BASIC 
database.

5)DataSourceTest checks the format of Connection.toString and in case of a 
database created with terriotry based collation, the format of database string 
is alphabets/alphabets where as the test is looking for just alphabets for the 
database name.

6)The diff in InListMultiProbeTest is expected because in a territory based 
database _ sorts before '1' whereas in a UCS_BASIC database, _ sorts after '1'. 
eg create and load a table using following script in both kinds of databases.
create table dellater(c1 char(1));
insert into dellater values('1'),('_'),('2');
Now, the result of following query differs in 2 kinds of databases
select * from dellater where c1 >= '_' order by c1;
TERRITORY BASED db
C1
----
_
1
2
3 rows selected

UCS_BASIC db
C1
----
_
1 row selected

7)Other than the ones mentioned above, there are few tests failing with assert 
exception but I do not think they are related to
changes that are in this patch. Kathey, I think you have run full collation 
tests in the past. I wonder if you had seen then 
assert failures even before applying this current patch of mine.


> Single character does not match high value unicode character with collation 
> TERRITORY_BASED
> -------------------------------------------------------------------------------------------
>
>                 Key: DERBY-2967
>                 URL: https://issues.apache.org/jira/browse/DERBY-2967
>             Project: Derby
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 10.4.0.0
>            Reporter: Kathey Marsden
>            Assignee: Mamta A. Satoor
>         Attachments: DERBY2967_offset_based_diff_Oct02_07.txt, 
> DERBY2967_offset_based_stat_Oct02_07.txt, fullcoll.out, 
> patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt, 
> patch2_with_setOffset_stat_Sep2007.txt, step1_iteratorbased_Sep1507_diff.txt, 
> step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt, temp_stat.txt, 
> TestFrench.java, TestNorway.java
>
>
> With TERRITORY_BASED collation '_' does not match  the character \uFA2D.  It 
> is the same for english or norwegian. FOR collation UCS_BASIC it matches 
> fine.  Could you tell me if this is a bug?
> Here is a program to reproduce.
> import java.sql.*;
> public class HighCharacter {
>    public static void main(String args[]) throws Exception
>    {
>    System.out.println("\n Territory no_NO");
>    Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
>    Connection conn = 
> DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Territory en_US");
>    conn = 
> DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
>    testLikeWithHighestValidCharacter(conn);
>    conn.close();
>    System.out.println("\n Collation USC_BASIC");
>    conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
>    testLikeWithHighestValidCharacter(conn);
>    }
> public static  void testLikeWithHighestValidCharacter(Connection conn) throws 
> SQLException {
>    Statement stmt = conn.createStatement();
>    try {
>    stmt.executeUpdate("drop table t1");
>    }catch (SQLException se)
>    {// drop failure ok.
>    }
>    stmt.executeUpdate("create table t1(c11 int)");
>    stmt.executeUpdate("insert into t1 values 1");
>  
>    // \uFA2D - the highest valid character according to
>    // Character.isDefined() of JDK 1.4;
>    PreparedStatement ps =
>    conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
>      String[] match = { "%", "_", "\uFA2D" };
>    for (int i = 0; i < match.length; i++) {
>    System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
>    ps.setString(1, match[i]);
>    ResultSet rs = ps.executeQuery();
>    if( rs.next() && rs.getString(1).equals("1"))
>        System.out.println("PASS");
>    else          System.out.println("FAIL: no match");
>    rs.close();
>    }
>   }
> }
> Mamta made some comments on this issue in the following thread:
> http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (DERBY-2967) Single character does not match high value unicode character with collation TERRITORY_BASED

Reply via email to