[Nutch-dev] Re: IndexOptimizer bug?

Doug Cutting Thu, 04 Aug 2005 08:49:02 -0700

Michael Nebel wrote:

The IndexOptimizer uses a different approach. If I read the code right,
it takes all terms with an idf under a special threshold and reduces the
entries. So the total number of documents for a search changes. With the
default configuration only about 10% of the terms stay in the index. So
the answer to the query "http" get's (much) smaller.

What I still do not know: yes a smaller index makes the system muchfast. But at which price does it come? Which numbers make sense?

IndexOptimizer was part of a never-completed attempt to implement atechnique somewhat related to what Torsten Suel describes in his"Optimized Query Execution in Large Search Engines with Global PageOrdering":


http://cis.poly.edu/suel/papers/order.pdf

A majority of search time is spent considering low-scoring documents forfrequent terms, documents which rarely appear in hit lists. Suelre-sorts document lists in the index by a document score, then simplystops searching once a certain number of matches are found. In theory ahigher-scoring match could still be found after this point, one with,e.g., very large TF values, but in practice this happens rarely.

At this point I don't think it's worth describing how IndexOptimizer fitinto this in more detail. Rather it would be better now to simply writesomething that could sort a Lucene index so that document numbersincrease with some document scoring function. Or, alternately, to sortdocuments prior to creating the Lucene index.


Doug


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: IndexOptimizer bug?

Reply via email to