[
https://issues.apache.org/jira/browse/DERBY-2967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mamta A. Satoor updated DERBY-2967:
-----------------------------------
Attachment: DERBY2967_offset_based_stat_Oct02_07.txt
DERBY2967_offset_based_diff_Oct02_07.txt
Attaching patch DERBY2967_offset_based_diff_Oct02_07.txt which fixes DERBY-2967
for both IBM's jdks and Sun's jdks. The fix involves using
CollationElementIterator to look through collation elements of a character
string as needed rather than fetching them all into an array before hand. With
the later approach, we may end up getting all the collation elements for a
really large string when we will be looking at say only first few of those
collation elements because we have found a mismatch between value and pattern.
The majority of the changes have gone into iapi.types.Like class. This class
has two different implementation of SQL LIKE clause. One implementation is for
UCS_BASIC character strings and the other one is for territory based character
strings. The territory based implementation of LIKE clause has been changed in
this patch. From a top level, this method does one to one comparison of
collation elements for string that is being searched and the pattern that is
being searched in that string. This one-to-one comparison is done for
non-metacharacters in the pattern. As soon as we find a metacharacter in the
pattern string, we do special processing depending on what kind of
metacharacter we are dealing with.
1)Taking the simplest case of escape character as the metacharacter. If the
user has provided an escape character, then as soon as we encounter the escape
character in the pattern string, we check if the character following it is a
metacharacter or
not. If not, we throw an exception. If the character indeed is a metacharacter,
we convert that metacharacter into it's collation element(s) and look for the
exact match of those collation element(s) into the CollationElementIterator for
the value string.
2)Taking the next simplest case of _. When we find a _ in the pattern string,
we advance in the CollationElementIterator for the value string by one
character. This is where it gets tricky ie what is defined as one character in
a particular locale. In Norwegian locale, 'b' is a single character and so are
'aa'. Both of these character strings translate into single collation element.
But in Norwegian locale, evne though '\uFA2D' is one character, it translates
into 2 collation elements. This advancing by one character and eating the right
number of collation elements is implemented in a new method in Like.java class
and the new method's name is advnaceByOne(CollationElementIterator).
3)Moving on to metacharacter %. A % can be satisfied by 0 to any number of
characters in the value string. In order to determine how many characters will
satisfy %, we have to start with 0 character and keep eating more and more
characters until we find a match for the remaining pattern string and value
string or until we reach end of value string but still non % characters left in
pattern string. In the later case, we know we have ended up with a mismatch and
we return a FALSE from the method.
The javadoc(from Like.java's like(CollationElementIterator, int, String, int,
String, RuleBasedCollator)) describes the implementation in little more detail
and I have copied that here for reference. Note that in the following method,
valueIterator is the CollationElementIterator for the character string that we
are doing a search into.
****************beginning of javadoc*****************
* This method will be called for character string types with territory
* based collation to see if valueIterator matches the passed pattern
* string. The logic of the method is described in following steps (I
* have tried to put the step identifiers in the code below for
reference).
* A)If pattern string or value Iterator is null, then this method will
* return null because the results of LIKE can't be established in
such
* a situation.
* B)Start the loop
* a)Check the lengths of valueIterator and pattern string to see if
it
* is time to return with TRUE/FALSE. The exact details of these
checks
* can be found in the javadoc method of
* checkLengths(CollationElementIterator, String, int)
* b)Start looking at pattern where the pointer is pointing and keep
* going until you find end of pattern or you find one of the
* metacharacters ie %, * or optional escape character.
* c)Get a CollationElementIterator for the non-metacharacters found
in
* step b(using the Collator passed to this method. The same
Collator
* was used to construct valueIterator).
* d)Make sure that collation elements found in step c) match the
* collation elements in valueIterator. A mismatch would require us
to
* return FALSE from this method.
* e)Check again the lengths of valueIterator and pattern string to
see
* if it is time toreturn with TRUE/FALSE. The exact details of
these
* checks can be found in the javadoc method of
* checkLengths(CollationElementIterator, String, int)
* f)If we are still here and didn't return from this method as a
result
* of step Be) then it means that we have to deal with the
* metacharacter found in step Bc). Check what metacharacter is the
* offset in pattern pointing to
* 1)If user has provided escape character and pattern is pointing
to
* one right now, then convert the next character in pattern to
it's
* collation element(s) and compare those collation elements to
* elements in valueIterator. If they do not match, we need to
return
* FALSE.
* 2)If it is not escape character, then check if it is a '%'. If
it is,
* then increment the pointer in the pattern string by 1 and then
* follow the involved algorithm below
* First check if we have reached the end of pattern. If yes,
then we
* can simply return from this method with TRUE return value. If
we
* have not reached end of pattern, then check if rest of the
* characters in pattern are all '%'. If yes, then we can simply
* return from this method wil TRUE return value. If not all %,
then
* take rest of the pattern string and see if it matches rest of
the
* valueIterator(will be implemented by recursively calling this
* method). If no match, then do the step Ba). If it is not time
to
* return because of step Ba), then advance the pointer in
* valueIterator by one character and see if the valueIterator
now
* matches the rest of the pattern string. Keep going until we
find
* the match or mismatch.
* 3)If it is not escape character or %, then check if it is a _.
If
* yes, then skip all the collation elements in valueIterator
* corresponding to the next character.
* g)Go back to step B).
****************end of javadoc*****************
I have changed SQLChar and WorkHorseForCollatorDatatypes to call this new
method for territory based character string types in Like rather than the old
implementation (I have removed the old implementation code from Like.java)
I have added a new test in CollationTest. The existing tests for LIKE in
CollationTest2 have been very handy during my testing of the code changes.
In addition to the above tests, I have run full collation test under Norwegian
territory. Following is the analysis of some of the test failures and fixes to
subset of them.
1)There are few existing tests that use character string 'aa' in their testing.
These existing test were written to run in UCS_BASIC collation and hence 'aa'
didn't cause a problem. But now when we run the full collation tests with say
Norwegian territory based locale, the character string 'aa' exhibit different
behavior(because in Norwegian, it is treated as one character) and hence give
the false impression of test failures. To avoid having to scan these false
failures every time one runs full
collation tests with Norwegian territory, I have changed the test data in some
tests. This has bring the test failures far lower when all the tests are run
with territory based locale rather than UCS_BASIC.
2)Few other failures in full collation test will be the error message string
mismatch when running the tests in UCS_BASIC
vs territory based. For eg, LOB.out has error messages like following
ERROR 42818: Comparisons between 'CHAR (UCS_BASIC)' and 'CLOB (UCS_BASIC)' are
not supported. Types must be comparable. String types must also have matching
collation. If collation does not match, a possible solution is to cast operands
to force them to the default collation (e.g. select tablename from
sys.systables where CAST(tablename as VARCHAR(128)) = 'T1')
This error message will have TERRITORY BASED rather than UCS_BASIC in it's text
when the test is run in territory based. So even though it might look like a
test failure when LOB test in run in territory based locale, it is really not a
test failure. I think there are 2 tests which will show diffs for this reason.
They are LOB.sql and implicitConversoins.
3)The diff in orderby.sql is also genuine because with UCS_BASIC database,
"Canada" sorts before "anaconda" but with
TERRITORY_BASED database, "anaconda" sorts before "Canada".
4)The diffs in Nist tests dml068 and dml079 look genuine too because data is
getting sorted in different order with TERRITORY_BASED database and UCS_BASIC
database.
5)DataSourceTest checks the format of Connection.toString and in case of a
database created with terriotry based collation, the format of database string
is alphabets/alphabets where as the test is looking for just alphabets for the
database name.
6)The diff in InListMultiProbeTest is expected because in a territory based
database _ sorts before '1' whereas in a UCS_BASIC database, _ sorts after '1'.
eg create and load a table using following script in both kinds of databases.
create table dellater(c1 char(1));
insert into dellater values('1'),('_'),('2');
Now, the result of following query differs in 2 kinds of databases
select * from dellater where c1 >= '_' order by c1;
TERRITORY BASED db
C1
----
_
1
2
3 rows selected
UCS_BASIC db
C1
----
_
1 row selected
7)Other than the ones mentioned above, there are few tests failing with assert
exception but I do not think they are related to
changes that are in this patch. Kathey, I think you have run full collation
tests in the past. I wonder if you had seen then
assert failures even before applying this current patch of mine.
> Single character does not match high value unicode character with collation
> TERRITORY_BASED
> -------------------------------------------------------------------------------------------
>
> Key: DERBY-2967
> URL: https://issues.apache.org/jira/browse/DERBY-2967
> Project: Derby
> Issue Type: Bug
> Components: SQL
> Affects Versions: 10.4.0.0
> Reporter: Kathey Marsden
> Assignee: Mamta A. Satoor
> Attachments: DERBY2967_offset_based_diff_Oct02_07.txt,
> DERBY2967_offset_based_stat_Oct02_07.txt, fullcoll.out,
> patch2_setOffset_fullcoll.out, patch2_with_setOffset_diff_Sep2007.txt,
> patch2_with_setOffset_stat_Sep2007.txt, step1_iteratorbased_Sep1507_diff.txt,
> step1_iteratorbased_Sep1507_stat.txt, temp_diff.txt, temp_stat.txt,
> TestFrench.java, TestNorway.java
>
>
> With TERRITORY_BASED collation '_' does not match the character \uFA2D. It
> is the same for english or norwegian. FOR collation UCS_BASIC it matches
> fine. Could you tell me if this is a bug?
> Here is a program to reproduce.
> import java.sql.*;
> public class HighCharacter {
> public static void main(String args[]) throws Exception
> {
> System.out.println("\n Territory no_NO");
> Class.forName("org.apache.derby.jdbc.EmbeddedDriver");
> Connection conn =
> DriverManager.getConnection("jdbc:derby:nordb;create=true;territory=no_NO;collation=TERRITORY_BASED");
> testLikeWithHighestValidCharacter(conn);
> conn.close();
> System.out.println("\n Territory en_US");
> conn =
> DriverManager.getConnection("jdbc:derby:endb;create=true;territory=en_US;collation=TERRITORY_BASED");
> testLikeWithHighestValidCharacter(conn);
> conn.close();
> System.out.println("\n Collation USC_BASIC");
> conn = DriverManager.getConnection("jdbc:derby:basicdb;create=true");
> testLikeWithHighestValidCharacter(conn);
> }
> public static void testLikeWithHighestValidCharacter(Connection conn) throws
> SQLException {
> Statement stmt = conn.createStatement();
> try {
> stmt.executeUpdate("drop table t1");
> }catch (SQLException se)
> {// drop failure ok.
> }
> stmt.executeUpdate("create table t1(c11 int)");
> stmt.executeUpdate("insert into t1 values 1");
>
> // \uFA2D - the highest valid character according to
> // Character.isDefined() of JDK 1.4;
> PreparedStatement ps =
> conn.prepareStatement("select 1 from t1 where '\uFA2D' like ?");
> String[] match = { "%", "_", "\uFA2D" };
> for (int i = 0; i < match.length; i++) {
> System.out.println("select 1 from t1 where '\\uFA2D' like " + match[i]);
> ps.setString(1, match[i]);
> ResultSet rs = ps.executeQuery();
> if( rs.next() && rs.getString(1).equals("1"))
> System.out.println("PASS");
> else System.out.println("FAIL: no match");
> rs.close();
> }
> }
> }
> Mamta made some comments on this issue in the following thread:
> http://www.nabble.com/Single-character-does-not-match-high-value-unicode-character-with-collation-TERRITORY_BASED.-Is-this-a-bug-tf4118767.html
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.