[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Robert Muir (JIRA) Wed, 01 Sep 2010 06:50:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905026#action_12905026
 ]


Robert Muir commented on LUCENE-2369:
-------------------------------------

bq. Do they or do they not need to be loaded into heap in order to be used for 
sorted search?

They are just regular terms! you can do a TermQuery on them, sort them as 
byte[], etc. 
its just the bytes use 'collation encoding' instead of 'utf-8 encoding'.
This is why i want to factor out the whole 'locale' thing from the issue, since 
sorting is agnostic to whats in the byte[], its unrelated and it would simplify 
the issue to just discuss that.

bq. Easy now. The whole runtime-vs-index-time issue is something that I don't 
care much about at this point. Pre-sorting can be done both at index and search 
time. Let's just say that we do it at index-time and go from there.

Well, the thing is, its something i care a lot about. The problems are:
* Users who develop localized applications tend to use methods with 
Locale/Collator parameters if they are available: its best practice.
* In the case of lucene, it is not best practice, but a silly trap (as you get 
horrible performance).
* However, users are used to the concept of collation keys wrt indexing (e.g. 
when building a database index)
* The apis here are wrong anyway: it shouldnt take Locale but Collator. 
There is no way to set strength or any other options, and theres no way to 
supply a Collator i made myself (e.g. from RuleBasedCollator)


> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache 
> which keeps all sort terms in memory. Beside the huge memory overhead, 
> searching requires comparison of terms with collator.compare every time, 
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of 
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries 
> in the sorted ordinals list. This results in very low memory overhead and 
> faster sorted searches, at the cost of increased startup-time. As the 
> ordinals can be resolved to terms after the sorting has been performed, this 
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 
> which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Reply via email to