[ 
https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018353#comment-13018353
 ] 

Steven Rowe commented on LUCENE-2798:
-------------------------------------

bq. it may be the use of _TestUtil.randomUnicodeString here.

It may, but the first above-listed seed produces this mismatch (strings are 
printed out as arrays of codepoints):

{noformat}
java.lang.AssertionError: -----------
Indexed string #45: [141]
 Sorted string #45: [141]
-----------
Indexed string #46: [32]
 Sorted string #46: [28, 777]
-----------
Indexed string #47: [28, 777]
 Sorted string #47: [32]

Collator strength: SECONDARY  Collator decomposition: CANONICAL_DECOMPOSITION
{noformat}

#46 and #47 include neither supplementary chars nor problematic BMP chars.

I wrote a test including just [32] and [28,777] as indexed strings, and the 
same mismatch occurs for random locales, regardless of collator decomposition, 
and for all collator strengths except PRIMARY.


> Randomize indexed collation key testing
> ---------------------------------------
>
>                 Key: LUCENE-2798
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2798
>             Project: Lucene - Java
>          Issue Type: Test
>          Components: Analysis
>    Affects Versions: 3.1, 4.0
>            Reporter: Steven Rowe
>            Assignee: Steven Rowe
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed 
> collation key testing is currently fragile (for example, they had to be 
> revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of 
> Unicode 6.0 collation changes) and coverage is trivial (only 5 locales 
> tested, and no collator options are exercised).  This affects both the JDK 
> implementation in {{modules/analysis/common/}} and the ICU implementation 
> under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as 
> that provided by the Collator itself.  Instead of the current set of static 
> tests, this could be achieved via indexing randomly generated terms' 
> collation keys (and collator options) and then comparing the index terms' 
> order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order 
> of indexed terms is inherently unstable.  When performing runtime collation, 
> the Collator addresses the sort stability issue by adding a secondary sort 
> over the normalized original terms.  In order to directly compare Collator's 
> sort with Lucene's collation key sort, a secondary sort will need to be 
> applied to Lucene's indexed terms as well. Robert has suggested indexing the 
> original terms in addition to their collation keys, then using a Sort over 
> the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and 
> trunk uses UTF-8 order, so the implemented secondary sort will need to 
> respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with 
> Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the 
> original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the 
> tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or 
> comparing .getBytes("UTF-8"))
> {quote}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to