Andrzej Bialecki wrote:
For single term queries (in Nutch - in Lucene they are rewritten to complex BooleanQueries), the hit lists are nearly identical for the first 10 hits, then they start to differ more and more as you progress along the original hit list. This is not so surprising - after all, this "optimization" operation is lossy. Still, the differences are higher than it was reported in that paper by Suel (but they used a different algorithm to select the postings) - Suel et al. were able to achieve 98% accuracy for the top-10 results, _including_ multi-term boolean queries.

A better way to prune the index might be to look at the sum of query-boosted scores from the content, title, url and anchor fields for each term. One could process four TermEnums in parallel, one for each field, and include documents in the index if the sum places them in the top 10%. But this is rather complex, and I am hopeful that a simpler method may work better.

For multi-term Nutch queries, which are rewritten to a combination of boolean queries and sloppy phrase queries, the effects are disastrous -

Yes, this is why I was discouraged and stopped working on this.

However I am now hopeful that sorting the entire index by page score and using top-1000 might work well with Nutch queries, since page score is field-independent, and I think fields cause the problems. Plus, this would be a lot simpler than the cross-field summing described above.

I can start writing an index-sorter today, unless you are already working on this. If you have an evaluation framework, that would be great.

Doug

Reply via email to