[jira] Created: (LUCENE-2798) Randomize collation testing

Steven Rowe (JIRA) Sat, 04 Dec 2010 08:59:37 -0800

Randomize collation testing
---------------------------

                 Key: LUCENE-2798
                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
             Project: Lucene - Java
          Issue Type: Test
          Components: contrib/*
    Affects Versions: 3.1, 4.0
            Reporter: Steven Rowe
            Priority: Minor
             Fix For: 3.1, 4.0



Robert Muir noted on #lucene IRC channel today that Lucene's indexed collation 
key testing is currently fragile (for example, they had to be revisited when 
Robert upgraded the ICU dependency in LUCENE-2797 because of Unicode 6.0 
collation changes) and coverage is trivial (only 5 locales tested, and no 
collator options are exercised).  This affects both the JDK implementation in 
{{modules/analysis/common/}} and the ICU implementation under {{modules/icu/}}.

The key thing to test is that the order of the indexed terms is the same as 
that provided by the Collator itself.  Instead of the current set of static 
tests, this could be achieved via indexing randomly generated terms' collation 
keys (and collator options) and then comparing the index terms' order to the 
order provided by the Collator over the original terms.

Since different terms may produce the same collation key, however, the order of 
indexed terms is inherently unstable.  When performing runtime collation, the 
Collator addresses the sort stability issue by adding a secondary sort over the 
normalized original terms.  In order to directly compare Collator's sort with 
Lucene's collation key sort, a secondary sort will need to be applied to 
Lucene's indexed terms as well. Robert has suggested indexing the original 
terms in addition to their collation keys, then using a Sort over the original 
terms as the secondary sort.

Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and trunk 
uses UTF-8 order, so the implemented secondary sort will need to respect that.

>From #lucene:
{quote}
rmuir__: so i think we have to on 3.x, sort the 'expected list' with 
Collator.compare, if thats equal, then as a tiebreak use String.compareTo
rmuir__: and in the index sort on the collated field, followed by the original 
term
rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the 
tiebreak for the expected list
rmuir__: instead compare codepoints (iterating character.codepointAt, or 
comparing .getBytes("UTF-8"))
{quote}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2798) Randomize collation testing

Reply via email to