Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Doug Cutting Mon, 12 Dec 2005 09:23:29 -0800

Andrzej Bialecki wrote:

For single term queries (in Nutch - in Lucene they are rewritten tocomplex BooleanQueries), the hit lists are nearly identical for thefirst 10 hits, then they start to differ more and more as you progressalong the original hit list. This is not so surprising - after all, this"optimization" operation is lossy. Still, the differences are higherthan it was reported in that paper by Suel (but they used a differentalgorithm to select the postings) - Suel et al. were able to achieve 98%accuracy for the top-10 results, _including_ multi-term boolean queries.

A better way to prune the index might be to look at the sum ofquery-boosted scores from the content, title, url and anchor fields foreach term. One could process four TermEnums in parallel, one for eachfield, and include documents in the index if the sum places them in thetop 10%. But this is rather complex, and I am hopeful that a simplermethod may work better.

For multi-term Nutch queries, which are rewritten to a combination ofboolean queries and sloppy phrase queries, the effects are disastrous -


Yes, this is why I was discouraged and stopped working on this.

However I am now hopeful that sorting the entire index by page score andusing top-1000 might work well with Nutch queries, since page score isfield-independent, and I think fields cause the problems. Plus, thiswould be a lot simpler than the cross-field summing described above.

I can start writing an index-sorter today, unless you are alreadyworking on this. If you have an evaluation framework, that would be great.


Doug

Re: IndexOptimizer (Re: Lucene performance bottlenecks)

Reply via email to