[ 
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905002#action_12905002
 ] 

Robert Muir commented on LUCENE-2369:
-------------------------------------

bq. I was thinking aggregation, but you are right. For aggregation one would of 
course just use the keys and have no need for the original Strings. Then we're 
left with federated search.

I don't see why federated search needs anything but sort keys?

bq. That is the memory overhead. If you have 20M terms of average length 10 
chars, that is 400MB in raw bytes and quite a bit more when you're taking 
pointers into account.

The "memory" overhead is no different than the "overhead" of regular terms, 
there is nothing special about the collation key case, this is my point (see 
below). and in practice for most people, its encoded as way less than 2 
bytes/char.

{quote}
I fail to see why that is a bad thing if we're looking at the rare scenario of 
having to postpone the sorting decision to search time. What is the 
alternative? Right now, search-time collator-based sorting with field cache has 
low startup time, high memory usage and horrible execution time for large 
results.
{quote}

Because "search-time" collator-sorting is the wrong approach, and should not 
exist at all.

Indexing with collation keys once we fix LUCENE-2551 has:
* same startup time as regular terms
* approximately the same memory usage as regular terms [e.g. PRIMARY key for 
"Robert Muir" is 12 bytes versus 11 bytes]
* same execution time (binary compare) as regular terms


> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache 
> which keeps all sort terms in memory. Beside the huge memory overhead, 
> searching requires comparison of terms with collator.compare every time, 
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of 
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries 
> in the sorted ordinals list. This results in very low memory overhead and 
> faster sorted searches, at the cost of increased startup-time. As the 
> ordinals can be resolved to terms after the sorting has been performed, this 
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 
> which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to