[
https://issues.apache.org/jira/browse/LUCENE-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13018374#comment-13018374
]
Steven Rowe commented on LUCENE-2798:
-------------------------------------
bq. Without looking too hard (are these hex values?)
No, it's just the output from Arrays.toString(int[]), which outputs decimal.
bq. in your debugging it would be useful to print the sort key as well.
Agreed. Here's the output:
{quote}
java.lang.AssertionError: -----------
Indexed string #0: [32]
Indexed collation key: [0, 0, 0, 119, 0, 0]
Sorted string #0: [28, 777]
Sorted collation key: [0, 0, 0, -101, 0, 0]
-----------
Indexed string #1: [28, 777]
Indexed collation key: [0, 0, 0, -101, 0, 0]
Sorted string #1: [32]
Sorted collation key: [0, 0, 0, 119, 0, 0]
Collator strength: SECONDARY Collator decomposition: NO_DECOMPOSITION
{quote}
(again with the Arrays.toString() for the byte array from the collation keys -
obviously not ideal in that they're first converted to signed integers...)
bq. Are the sort keys the same?
No.
> Randomize indexed collation key testing
> ---------------------------------------
>
> Key: LUCENE-2798
> URL: https://issues.apache.org/jira/browse/LUCENE-2798
> Project: Lucene - Java
> Issue Type: Test
> Components: Analysis
> Affects Versions: 3.1, 4.0
> Reporter: Steven Rowe
> Assignee: Steven Rowe
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-2798.patch
>
>
> Robert Muir noted on #lucene IRC channel today that Lucene's indexed
> collation key testing is currently fragile (for example, they had to be
> revisited when Robert upgraded the ICU dependency in LUCENE-2797 because of
> Unicode 6.0 collation changes) and coverage is trivial (only 5 locales
> tested, and no collator options are exercised). This affects both the JDK
> implementation in {{modules/analysis/common/}} and the ICU implementation
> under {{modules/icu/}}.
> The key thing to test is that the order of the indexed terms is the same as
> that provided by the Collator itself. Instead of the current set of static
> tests, this could be achieved via indexing randomly generated terms'
> collation keys (and collator options) and then comparing the index terms'
> order to the order provided by the Collator over the original terms.
> Since different terms may produce the same collation key, however, the order
> of indexed terms is inherently unstable. When performing runtime collation,
> the Collator addresses the sort stability issue by adding a secondary sort
> over the normalized original terms. In order to directly compare Collator's
> sort with Lucene's collation key sort, a secondary sort will need to be
> applied to Lucene's indexed terms as well. Robert has suggested indexing the
> original terms in addition to their collation keys, then using a Sort over
> the original terms as the secondary sort.
> Another complication: Lucene 3.X uses Java's UTF-16 term comparison, and
> trunk uses UTF-8 order, so the implemented secondary sort will need to
> respect that.
> From #lucene:
> {quote}
> rmuir__: so i think we have to on 3.x, sort the 'expected list' with
> Collator.compare, if thats equal, then as a tiebreak use String.compareTo
> rmuir__: and in the index sort on the collated field, followed by the
> original term
> rmuir__: in 4.x we do the same thing, but dont use String.compareTo as the
> tiebreak for the expected list
> rmuir__: instead compare codepoints (iterating character.codepointAt, or
> comparing .getBytes("UTF-8"))
> {quote}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]