nutch-dev  

Re: IndexOptimizer bug?

Doug Cutting
Thu, 04 Aug 2005 08:47:26 -0700

Michael Nebel wrote:
The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.

What I still do not know: yes a smaller index makes the system much fast. But at which price does it come? Which numbers make sense?

IndexOptimizer was part of a never-completed attempt to implement a technique somewhat related to what Torsten Suel describes in his "Optimized Query Execution in Large Search Engines with Global Page Ordering":

http://cis.poly.edu/suel/papers/order.pdf

A majority of search time is spent considering low-scoring documents for frequent terms, documents which rarely appear in hit lists. Suel re-sorts document lists in the index by a document score, then simply stops searching once a certain number of matches are found. In theory a higher-scoring match could still be found after this point, one with, e.g., very large TF values, but in practice this happens rarely.

At this point I don't think it's worth describing how IndexOptimizer fit into this in more detail. Rather it would be better now to simply write something that could sort a Lucene index so that document numbers increase with some document scoring function. Or, alternately, to sort documents prior to creating the Lucene index.

Doug