I've been tinkering with the SegmentReader and tried to make a sorter that keeps track of ordinals for terms instead of the terms themselves. I call it Exposed, as it exposes inner details.
A detailed description of the idea can be found at http://sbdevel.wordpress.com/2010/03/19/string-sorting-in-lucene-without-the-memory-penalty/ but the important points are below. The idea is simple enough: Make an array of ordinals for the sort-terms, sort them with a Collator that performs a lookup of the actual Strings. Using that array of ordinals, another array going from docID to index in the sorted ordinals array is created. When a sort is performed, a comparison of the entries in the docID->sorted ordinals array is enough to determine the order. The memory savings compared to the traditional approach of holding the terms in memory are substantial. When the structure has been build, it takes ~5MB to hold sorting information for 1M docs containing 1M unique terms in the sort field, ~55MB for 10M (using the PackedInts from LUCENE-1990). Temporary memory use for building is around 2 * #docs * 4 bytes for merge sort + cruft. In theory. Also in theory, this could be closer to #docs * 4 bytes. In reality I haven't had the time to switch Integers to ints in the sorter, so it's quite a bit more. As all sorting is done upfront, the startup time is non-trivial. For ~10M sort terms it takes 6 minutes on my i7-machine using a conventional harddisk. After that, sorted searches are a factor 10 faster than sorting by String-compare for ~1M hits (top 20 displayed). To kick things off, I've created a Lucene-fork at http://github.com/tokee/lucene/ with the tag lucene_3_0_exposed. It contains a proof of concept that can be used on an optimized Lucene index. It can be tested by running java-cp /home/te/projects/lucene/build/lucene-core-3.1-dev.jar org.apache.lucene.index.ExposedPOC. To me, the trade-offs seems to be new Sort(new SortField(field, locale)) * Fast startup * High memory consumption * Slow sorted searches (Collator.compare) * Instantaneous String delivery Exposed * Slow startup (this is per-segment, so re-open should work well) * Low temporary and very low permanent memory consumption * Fast sorted searches (integer compare) * Fast String delivery (Term-lookup by ordinal) As memory-problems with sorting is a recurring topic on the user-list, having an easy alternative to buying more RAM would be nice. I would like to hear if Exposed sounds like a feasible idea to the more seasoned Lucene developers. Regards, Toke Eskildsen --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org