Doug Cutting
Thu, 04 Aug 2005 08:47:26 -0700
Michael Nebel wrote:
The IndexOptimizer uses a different approach. If I read the code right, it takes all terms with an idf under a special threshold and reduces the entries. So the total number of documents for a search changes. With the default configuration only about 10% of the terms stay in the index. So the answer to the query "http" get's (much) smaller.What I still do not know: yes a smaller index makes the system much fast. But at which price does it come? Which numbers make sense?
IndexOptimizer was part of a never-completed attempt to implement a technique somewhat related to what Torsten Suel describes in his "Optimized Query Execution in Large Search Engines with Global Page Ordering":
http://cis.poly.edu/suel/papers/order.pdfA majority of search time is spent considering low-scoring documents for frequent terms, documents which rarely appear in hit lists. Suel re-sorts document lists in the index by a document score, then simply stops searching once a certain number of matches are found. In theory a higher-scoring match could still be found after this point, one with, e.g., very large TF values, but in practice this happens rarely.
At this point I don't think it's worth describing how IndexOptimizer fit into this in more detail. Rather it would be better now to simply write something that could sort a Lucene index so that document numbers increase with some document scoring function. Or, alternately, to sort documents prior to creating the Lucene index.
Doug