[ 
https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630806#action_12630806
 ] 

Steven Rowe commented on LUCENE-1279:
-------------------------------------

{quote}
from the Collator javadocs:
bq. When sorting a list of Strings however, it is generally necessary to 
compare each String multiple times. In this case, CollationKeys provide better 
performance. The CollationKey class converts a String to a series of bits that 
can be compared bitwise against other CollationKeys. A CollationKey is created 
by a Collator object for a given String. 

I don't think we need to implement this now, but I wonder if there is a 
performance difference if we created the CollationKey for comparison. The big 
question is whether the construction of that for each term outweighs the 
savings by repeated comparisons to lower and upper.
{quote}

I think the problem is that every single index term has to be converted to a 
CollationKey for every single (range) search.  In an earlier comment on this 
issue, Hoss said:

bq. 4) when i first saw the thread that spawned this issue, my first reaction 
was to wonder if it would make sense to start allowing a Collator to be 
specified when indexing, and to use the raw bytes from the CollationKey as the 
indexed value - I haven't thought it through very hard, but i wonder if that 
would be feasible (it seems like it would certainly faster at query time, since 
it would allow more traditional term skipping.

I'm working on a utility class to store arbitrary binary in sortable, indexable 
Strings, so that CollationKeys can be stored in the index.  IMHO, though, this 
issue should still go forward.

bq. One more question, and it probably shows my lack of knowledge here, but 
would it be possible to enumerate the various codepoints where there are 
problems and just handle these separately, somehow? Basically, how pervasive is 
the problem? Would we perhaps be better off having a check to see if one of 
these bad codepoints falls in the range of lower/upper and then handle it 
separately?

Languages, in some cases using the same character repertoire, define different 
orderings.  Also, I believe some orderings are context dependent - you can't 
always compare character by character.   So adding this stuff to Lucene would 
be to duplicate a lot of the stuff that's already done in the Collator.

> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Assignee: Grant Ingersoll
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch, LUCENE-1279.patch, LUCENE-1279.patch, 
> LUCENE-1279.patch
>
>
> See [this java-user 
> discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of 
> problems caused by Unicode code-point comparison, instead of collation, in 
> RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a 
> java.text.Collator and/or CollationKey's, to handle ranges for languages 
> which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to