[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Steven Rowe (JIRA) Tue, 06 May 2008 06:11:29 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12594574#action_12594574
 ]


Steven Rowe commented on LUCENE-1279:
-------------------------------------

bq. 1) you should be able to at least start the enumerator by skiping to a term 
consisting of the lowerTermField and the termText of "" ... even if the 
Collation of the term text is random, you still know which field you want.

I thought I did that - from the patch:

{code:java}
    TermEnum enumerator = reader.terms(new Term(getField(), ""));
    ...
  public String getField() {
    return (lowerTerm != null ? lowerTerm.field() : upperTerm.field());
  }
{code}

bq. 2) why can a collator only be specified by a Locale, why not just let 
people specify the Collator they want directly?

In the java-user thread that spawned this issue, I mentioned that this would be 
necessary for custom Collators.  I used Locale because it's simpler to specify, 
but you're right, directly specifying a Collator makes more sense.

bq. 3) instead of adding a new public CollatingRangeQuery, would it make more 
sense to add an optional Collator to RangeQuery (and RangeFilter) which 
triggers a different code path when non null? (from a performance standpoint it 
would basically be one conditional check at the begining of the rewrite method.)

This was my original thought, but since the performance impact could be large 
compared to a standard RangeQuery, I thought it made more sense to put it where 
it couldn't be used accidentally :).  I can redo it to integrate with the 
existing classes, though.

bq. 4) when i first saw the thread that spawned this issue, my first reaction 
was to wonder if it would make sense to start allowing a Collator to be 
specified when indexing, and to use the raw bytes from the CollationKey as the 
indexed value - I haven't thought it through very hard, but i wonder if that 
would be feasible (it seems like it would certainly faster at query time, since 
it would allow more traditional term skipping.

I thought of something similar, but wow, this would be large.  It would require 
that the exact Collator used to generate the index terms also be used to 
generate CollationKeys for RangeQuery's/Filter's -- the Collator's rules would 
have to be stored in the index.  Also, how would binary CollationKey 
(de-)serialization fit into the String (de-)serialization currently in place 
for index terms?

My guess is that the functionality provided here is most useful for fields with 
a small number of terms -- especially in the case of RangeQuery's, where the 
BooleanQuery clause limit is not guarded against.  Given this IMHO most likely 
scenario, the performance optimization you're talking about (and the attendant 
code complexification) probably isn't warranted.


> RangeQuery and RangeFilter should use collation to check for range inclusion
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-1279
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1279
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.3.1
>            Reporter: Steven Rowe
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: LUCENE-1279.patch
>
>
> See [this java-user 
> discussion|http://www.nabble.com/lucene-farsi-problem-td16977096.html] of 
> problems caused by Unicode code-point comparison, instead of collation, in 
> RangeQuery.
> RangeQuery could take in a Locale via a setter, which could be used with a 
> java.text.Collator and/or CollationKey's, to handle ranges for languages 
> which have alphabet orderings different from those in Unicode.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1279) RangeQuery and RangeFilter should use collation to check for range inclusion

Reply via email to