Marvin Humphrey wrote:
On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
Nutch implements host-deduping roughly as follows:
To fetch the first 10 hits it first asks for the top-scoring 20 or
so. Then it uses a field cache to reduce this to just two from each
host. If it runs out of raw hits, then it re-runs the query, this
time for the top scoring 40 hits. But the query is modified this
time to exclude matches from hosts that have already returned more
than two hits. (Nutch also automatically converts clauses like "-
host:foo.com" into cached filters when "foo.com" occurs in more than
a certain percentage of documents.)
Is that an optimization which only works for Nutch and hosts, or is it
something that could be generalized and implemented sanely in Lucene?
It's probably generalizeable.
The stuff that optimizes queries into filters is in:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup
The deduping logic is in:
http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/NutchBean.java?view=markup
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]