Re: Lucene performance bottlenecks

Doug Cutting Thu, 08 Dec 2005 09:59:48 -0800

Doug Cutting wrote:

Implementing something like this for Lucene would not be too difficult.The index would need to be re-sorted by document boost: documents wouldbe re-numbered so that highly-boosted documents had low documentnumbers.


In particular, one could:

1. Create an array of int[maxDoc], with a[i] = i.
2. Sort the array with order(i,j) = boost(i) - boost(j);

3. Implement a FilterIndexReader that re-numbers using the sorted array.So, for example, the document numbers in the TermPositions willa[old.doc()]. Each term's positions will need to be read entirely intomemory and sorted to perform this renumbering.

The IndexOptimizer.java class in the searcher package was an old attemptto create something like what Suel calls "fancy postings". It createsan index with the top 10% scoring postings. Since documents are notrenumbered one can intermix postings from this with the full index. Sofor example, one can first try searching using this index for terms thatoccur more than, e.g., 10k times, and use the full index for rarerwords. If that does not find 1000 hits then the full index must besearched. Such an approach can be combined with using a pre-sorted index.

I think the first thing to implement would be to implement somethinglike what Suel calls first-1000. Then we need to evaluate this anddetermine, for query log, how different the results are.

Then a HitCollector can simply stop searching once a givennumber of hits are found.
Doug

Re: Lucene performance bottlenecks

Reply via email to