[jira] Commented: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into field cache

Toke Eskildsen (JIRA) Fri, 26 Mar 2010 02:44:54 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850055#action_12850055
 ]


Toke Eskildsen commented on LUCENE-2335:
----------------------------------------

Just a little status on development.

The SegmentReader has been modified to implement the {{getOrdinalTerms}} stated 
above and it works as expected (no surprise, as this is just a refactoring and 
cleanup of the code from the proof of concept). I'm moving on to 
{{DirectoryReader}}. When that is done, I'll make a preliminary patch. For easy 
experimentation, I could try and make {{the new Sort(new SortField("myfield", 
new Locale("da")));}} use the new code.

Sorting 10M terms still takes about 6 minutes on my machine. I've profiled the 
code and about 40% of the time is spend on requesting terms (with a cache of 
20000 terms, locale da, using a conventional harddisk). There's room for 
optimization by smarter caching of terms and of course using a faster Collator 
than Java's default. Using simple {{String.compareTo}}, which leads to nearly 
sequential access to terms, takes 44 seconds for the 10M terms. The key to real 
improvement seems to be a better exploitation of the observation that "Unicode 
order is fairly aligned to most Locale-based orders". It is already working 
somewhat as dumb in-memory merge-sort of the 10M Strings takes about 10 
minutes. I'll save that for later though.

Now, for merges there is some decisions to make. The primary one being how to 
handle terms that are the same for different segments. For the first take, I'll 
avoid merging such terms and thus allow the iterator to deliver the same term 
more than once for {{DirectoryReader}}s, but with different ordinals. This 
should work fine for direct building of a docID->sortedOrdinalIndex map, usable 
for sorting, which is the primary use-case.

If the structure is to be used for faceting, the terms instances needs to be 
collapsed. Luckily the logic for doing so resides at the code that used the 
iterator, rather than internally in the readers. Using Exposed for 
memory-efficient faceting should still be feasible with the general code. If 
the order of the tags in the facet is not significant, the 
Unicode-order-comparator can be used, making it very fast (1-2 minutes / 10M 
unique terms) to build the structure for a given field. ...But I digress and 
will try and get back on track with the real code.

> optimization: when sorting by field, if index has one segment and field 
> values are not needed, do not load String[] into field cache
> ------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-2335
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2335
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Michael McCandless
>            Priority: Minor
>             Fix For: 3.1
>
>
> Spinoff from java-dev thread "Sorting with little memory: A suggestion", 
> started by Toke Eskildsen.
> When sorting by SortField.STRING we currently ask FieldCache for a 
> StringIndex on that field.
> This can consumes tons of RAM, when the values are mostly unique (eg a title 
> field), as it populates both int[] ords as well as String[] values.
> But, if the index is only one segment, and the search sets fillFields=false, 
> we don't need the String[] values, just the int[] ords.  If the app needs to 
> show the fields it can pull them (for the 1 page) from stored fields.
> This can be a potent optimization -- alot of RAM saved -- for optimized 
> indexes.
> When fixing this we must take care to share the int[] ords if some queries do 
> fillFields=true and some =false... ie, FieldCache will be called twice and it 
> should share the int[] ords across those invocations.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2335) optimization: when sorting by field, if index has one segment and field values are not needed, do not load String[] into field cache

Reply via email to