I've been tinkering with the SegmentReader and tried to make a sorter
that keeps track of ordinals for terms instead of the terms themselves.
I call it Exposed, as it exposes inner details.

A detailed description of the idea can be found at
http://sbdevel.wordpress.com/2010/03/19/string-sorting-in-lucene-without-the-memory-penalty/
but the important points are below.


The idea is simple enough: Make an array of ordinals for the sort-terms,
sort them with a Collator that performs a lookup of the actual Strings.
Using that array of ordinals, another array going from docID to index in
the sorted ordinals array is created.

When a sort is performed, a comparison of the entries in the
docID->sorted ordinals array is enough to determine the order.

The memory savings compared to the traditional approach of holding the
terms in memory are substantial. When the structure has been build, it
takes ~5MB to hold sorting information for 1M docs containing 1M unique
terms in the sort field, ~55MB for 10M (using the PackedInts from
LUCENE-1990).

Temporary memory use for building is around 2 * #docs * 4 bytes for
merge sort + cruft. In theory. Also in theory, this could be closer to
#docs * 4 bytes. In reality I haven't had the time to switch Integers to
ints in the sorter, so it's quite a bit more.

As all sorting is done upfront, the startup time is non-trivial.
For ~10M sort terms it takes 6 minutes on my i7-machine using a
conventional harddisk. After that, sorted searches are a factor 10
faster than sorting by String-compare for ~1M hits (top 20 displayed).


To kick things off, I've created a Lucene-fork at 
http://github.com/tokee/lucene/ with the tag lucene_3_0_exposed.
It contains a proof of concept that can be used on an optimized Lucene
index. It can be tested by running
java-cp /home/te/projects/lucene/build/lucene-core-3.1-dev.jar
org.apache.lucene.index.ExposedPOC.


To me, the trade-offs seems to be

new Sort(new SortField(field, locale))
 * Fast startup
 * High memory consumption
 * Slow sorted searches (Collator.compare)
 * Instantaneous String delivery

Exposed
 * Slow startup (this is per-segment, so re-open should work well)
 * Low temporary and very low permanent memory consumption
 * Fast sorted searches (integer compare)
 * Fast String delivery (Term-lookup by ordinal)


As memory-problems with sorting is a recurring topic on the user-list,
having an easy alternative to buying more RAM would be nice. I would
like to hear if Exposed sounds like a feasible idea to the more seasoned
Lucene developers.

Regards,
Toke Eskildsen


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to