Michael Nebel wrote:
The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.

What I still do not know: yes a smaller index makes the system much fast. But at which price does it come? Which numbers make sense?

IndexOptimizer was part of a never-completed attempt to implement a technique somewhat related to what Torsten Suel describes in his "Optimized Query Execution in Large Search Engines with Global Page Ordering":

http://cis.poly.edu/suel/papers/order.pdf

A majority of search time is spent considering low-scoring documents for frequent terms, documents which rarely appear in hit lists. Suel re-sorts document lists in the index by a document score, then simply stops searching once a certain number of matches are found. In theory a higher-scoring match could still be found after this point, one with, e.g., very large TF values, but in practice this happens rarely.

At this point I don't think it's worth describing how IndexOptimizer fit into this in more detail. Rather it would be better now to simply write something that could sort a Lucene index so that document numbers increase with some document scoring function. Or, alternately, to sort documents prior to creating the Lucene index.

Doug


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to