Michael Nebel wrote:
The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.
What I still do not know: yes a smaller index makes the system much
fast. But at which price does it come? Which numbers make sense?
IndexOptimizer was part of a never-completed attempt to implement a
technique somewhat related to what Torsten Suel describes in his
"Optimized Query Execution in Large Search Engines with Global Page
Ordering":
http://cis.poly.edu/suel/papers/order.pdf
A majority of search time is spent considering low-scoring documents for
frequent terms, documents which rarely appear in hit lists. Suel
re-sorts document lists in the index by a document score, then simply
stops searching once a certain number of matches are found. In theory a
higher-scoring match could still be found after this point, one with,
e.g., very large TF values, but in practice this happens rarely.
At this point I don't think it's worth describing how IndexOptimizer fit
into this in more detail. Rather it would be better now to simply write
something that could sort a Lucene index so that document numbers
increase with some document scoring function. Or, alternately, to sort
documents prior to creating the Lucene index.
Doug
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers