[ https://issues.apache.org/jira/browse/LUCENE-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905507#action_12905507 ]
Toke Eskildsen commented on LUCENE-2369: ---------------------------------------- {quote} They are just regular terms! you can do a TermQuery on them, sort them as byte[], etc. its just the bytes use 'collation encoding' instead of 'utf-8 encoding'. {quote} Yes, you have stated that repeatedly. As you are unwilling to answer my questions, I've tweaked my tests to see for myself. It worked very well, as I learned something new. To answer one of the questions, then yes, the terms are loaded into memory when a sort is performed on a field. This is the case for the locale-based sort (yes, I have understood that you do not consider that a viable form of sorting and I only write it for completeness), for STRING-sorting (natural order with ordinal based speedup) and for STRING_VAL-sorting (natural order without the ordinal-speedup). All this against Lucene truck, no patches. {quote} This is why i want to factor out the whole 'locale' thing from the issue, since sorting is agnostic to whats in the byte[], its unrelated and it would simplify the issue to just discuss that. {quote} For my new tests I've switched to natural order sorting (direct byte[] comparison aka Lucene STRING sorted search without specifying a Locale). The tests should be fairly telling for the different scenarios as the ICU keys should be about the same size as the original terms. {quote} * The apis here are wrong anyway: it shouldnt take Locale but Collator. There is no way to set strength or any other options, and theres no way to supply a Collator i made myself (e.g. from RuleBasedCollator) {quote} I fully agree. The code I've made so far takes a Comparator<BytesRef>, with optimization for wrapped Collators. The title of LUCENE-2369 is not technically correct but was chosen as "Locale" is a fairly known concept while "Collator" is more complex. That might have been a mistake. Onwards to testing with natural order (new Sort(new SortField(myField, SortField.STRING))). No ZIPping in the background this time, so measurements will differ from previous test. Heap sizes were measured after a call to System.gc(). 2M document index, search hits 1M documents, top 10 hits extracted: * Initial exposed search: 0:16 minutes * Subsequent exposed searches: 40 ms * Total heap usage for Lucene + exposed structure: 20 MB * Initial default Lucene search: 0.8 s * Subsequent default Lucene searches: 25 ms * Total heap usage for Lucene + field cache: 60 MB 20M document index, search hits 10M documents, top 10 hits extracted: * Initial exposed search: 2:53 minutes * Subsequent exposed searches: 330 ms * Total heap usage for Lucene + exposed structure: 154 MB * Initial default Lucene search: 6 s * Subsequent default Lucene searches: 220 ms * Total heap usage for Lucene + field cache: 600 MB 200M document index, search hits 100M documents, top 10 hits extracted: * Initial exposed search: 31:33 minutes * Subsequent exposed searches: 3200 ms * Total heap usage for Lucene + exposed structure: 1660 MB * No data for default Lucene search as there was OOM with 7 GB of heap. What did we learn from this? * Natural order search in Lucene with STRING is very fast (as most of the work is ordinal comparison). * Exposed sorting is actually slower than natural order (that's news for me). The culprit is a modified PackedInts-structure. I'll look into that. * The exposed structure build penalty for the hybrid approach (e.g. relying on natural order instead of doing an explicit sort) was indeed markedly lower than exposed with explicit sorting. A factor 5. I would have expected it to be more though. * The hybrid approach uses less than a third of the amount of RAM required by Lucene natural order sorting. So, Robert, does this answer your challenge "if you have a way to improve memory usage for byte[] terms, lets look just at that?"? > Locale-based sort by field with low memory overhead > --------------------------------------------------- > > Key: LUCENE-2369 > URL: https://issues.apache.org/jira/browse/LUCENE-2369 > Project: Lucene - Java > Issue Type: New Feature > Components: Search > Reporter: Toke Eskildsen > Priority: Minor > Attachments: LUCENE-2369.patch > > > The current implementation of locale-based sort in Lucene uses the FieldCache > which keeps all sort terms in memory. Beside the huge memory overhead, > searching requires comparison of terms with collator.compare every time, > making searches with millions of hits fairly expensive. > This proposed alternative implementation is to create a packed list of > pre-sorted ordinals for the sort terms and a map from document-IDs to entries > in the sorted ordinals list. This results in very low memory overhead and > faster sorted searches, at the cost of increased startup-time. As the > ordinals can be resolved to terms after the sorting has been performed, this > approach supports fillFields=true. > This issue is related to https://issues.apache.org/jira/browse/LUCENE-2335 > which contain previous discussions on the subject. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org