Andrzej Bialecki wrote:
For single term queries (in Nutch - in Lucene they are rewritten to
complex BooleanQueries), the hit lists are nearly identical for the
first 10 hits, then they start to differ more and more as you progress
along the original hit list. This is not so surprising - after all, this
"optimization" operation is lossy. Still, the differences are higher
than it was reported in that paper by Suel (but they used a different
algorithm to select the postings) - Suel et al. were able to achieve 98%
accuracy for the top-10 results, _including_ multi-term boolean queries.
A better way to prune the index might be to look at the sum of
query-boosted scores from the content, title, url and anchor fields for
each term. One could process four TermEnums in parallel, one for each
field, and include documents in the index if the sum places them in the
top 10%. But this is rather complex, and I am hopeful that a simpler
method may work better.
For multi-term Nutch queries, which are rewritten to a combination of
boolean queries and sloppy phrase queries, the effects are disastrous -
Yes, this is why I was discouraged and stopped working on this.
However I am now hopeful that sorting the entire index by page score and
using top-1000 might work well with Nutch queries, since page score is
field-independent, and I think fields cause the problems. Plus, this
would be a lot simpler than the cross-field summing described above.
I can start writing an index-sorter today, unless you are already
working on this. If you have an evaluation framework, that would be great.
Doug