[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Toke Eskildsen (JIRA) Tue, 31 Aug 2010 13:55:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904765#action_12904765
 ]


Toke Eskildsen commented on LUCENE-2369:
----------------------------------------

{quote}
Toke, have you tried doing this 'build' at index time instead? I would 
recommend applying LUCENE-2551 and indexing with ICU Collation, strength=primary
{quote}

Robert, I'll sum up my understanding of the issue:
 * ICU collator keys makes sorting very fast at the cost of some extra disk 
space, as one will probably want to store the original Term together with the 
key. It requires a non-trivial memory overhead, in the ideal case as many bytes 
as there are characters in the terms. Works extremely well with reopening.
 * My experiment makes sorting relatively low-memory and extremely fast at the 
cost of very high pre-calculation time. Works halfway well with reopening as 
some structures are reused.

The two approaches are not in conflict and combining them would indeed seem to 
give many benefits. Moving the building of the structures to index-time seems 
fairly easy: If nothing else, it could just be a post-processing of the index.

ICU is clearly what's on people's mind when it comes to collator based sorting. 
I can see that I have to do some Lucene standard vs. ICU vs. pre-calculated vs. 
ICU+pre-calculated tests to explore what the benefits of the different 
approaches are.

{quote}
Now that we can mostly do everything as bytes, I think this slow functionality 
to do collation/range query at 'runtime' might soon be on its way out of lucene 
(see patches on LUCENE-2514).
{quote}

No argument from me. I'll keep my work at the runtime level for now though, but 
that's just to avoid working on two fronts at the same time.

{quote}
Instead, I think its better to encourage users to index their content 
accordingly for the use cases they need.
{quote}

I agree that the sort-fields as well as sort-locale is well known at index time 
in most cases.

> Locale-based sort by field with low memory overhead
> ---------------------------------------------------
>
>                 Key: LUCENE-2369
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2369
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Search
>            Reporter: Toke Eskildsen
>            Priority: Minor
>
> The current implementation of locale-based sort in Lucene uses the FieldCache 
> which keeps all sort terms in memory. Beside the huge memory overhead, 
> searching requires comparison of terms with collator.compare every time, 
> making searches with millions of hits fairly expensive.
> This proposed alternative implementation is to create a packed list of 
> pre-sorted ordinals for the sort terms and a map from document-IDs to entries 
> in the sorted ordinals list. This results in very low memory overhead and 
> faster sorted searches, at the cost of increased startup-time. As the 
> ordinals can be resolved to terms after the sorting has been performed, this 
> approach supports fillFields=true.
> This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 
> which contain previous discussions on the subject.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2369) Locale-based sort by field with low memory overhead

Reply via email to