[ https://issues.apache.org/jira/browse/LUCENE-2335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850055#action_12850055 ]
Toke Eskildsen commented on LUCENE-2335: ---------------------------------------- Just a little status on development. The SegmentReader has been modified to implement the {{getOrdinalTerms}} stated above and it works as expected (no surprise, as this is just a refactoring and cleanup of the code from the proof of concept). I'm moving on to {{DirectoryReader}}. When that is done, I'll make a preliminary patch. For easy experimentation, I could try and make {{the new Sort(new SortField("myfield", new Locale("da")));}} use the new code. Sorting 10M terms still takes about 6 minutes on my machine. I've profiled the code and about 40% of the time is spend on requesting terms (with a cache of 20000 terms, locale da, using a conventional harddisk). There's room for optimization by smarter caching of terms and of course using a faster Collator than Java's default. Using simple {{String.compareTo}}, which leads to nearly sequential access to terms, takes 44 seconds for the 10M terms. The key to real improvement seems to be a better exploitation of the observation that "Unicode order is fairly aligned to most Locale-based orders". It is already working somewhat as dumb in-memory merge-sort of the 10M Strings takes about 10 minutes. I'll save that for later though. Now, for merges there is some decisions to make. The primary one being how to handle terms that are the same for different segments. For the first take, I'll avoid merging such terms and thus allow the iterator to deliver the same term more than once for {{DirectoryReader}}s, but with different ordinals. This should work fine for direct building of a docID->sortedOrdinalIndex map, usable for sorting, which is the primary use-case. If the structure is to be used for faceting, the terms instances needs to be collapsed. Luckily the logic for doing so resides at the code that used the iterator, rather than internally in the readers. Using Exposed for memory-efficient faceting should still be feasible with the general code. If the order of the tags in the facet is not significant, the Unicode-order-comparator can be used, making it very fast (1-2 minutes / 10M unique terms) to build the structure for a given field. ...But I digress and will try and get back on track with the real code. > optimization: when sorting by field, if index has one segment and field > values are not needed, do not load String[] into field cache > ------------------------------------------------------------------------------------------------------------------------------------ > > Key: LUCENE-2335 > URL: https://issues.apache.org/jira/browse/LUCENE-2335 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Michael McCandless > Priority: Minor > Fix For: 3.1 > > > Spinoff from java-dev thread "Sorting with little memory: A suggestion", > started by Toke Eskildsen. > When sorting by SortField.STRING we currently ask FieldCache for a > StringIndex on that field. > This can consumes tons of RAM, when the values are mostly unique (eg a title > field), as it populates both int[] ords as well as String[] values. > But, if the index is only one segment, and the search sets fillFields=false, > we don't need the String[] values, just the int[] ords. If the app needs to > show the fields it can pull them (for the 1 page) from stored fields. > This can be a potent optimization -- alot of RAM saved -- for optimized > indexes. > When fixing this we must take care to share the int[] ords if some queries do > fillFields=true and some =false... ie, FieldCache will be called twice and it > should share the int[] ords across those invocations. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org