Marvin Humphrey wrote:
On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
Nutch implements host-deduping roughly as follows:

To fetch the first 10 hits it first asks for the top-scoring 20 or so. Then it uses a field cache to reduce this to just two from each host. If it runs out of raw hits, then it re-runs the query, this time for the top scoring 40 hits. But the query is modified this time to exclude matches from hosts that have already returned more than two hits. (Nutch also automatically converts clauses like "- host:foo.com" into cached filters when "foo.com" occurs in more than a certain percentage of documents.)

Is that an optimization which only works for Nutch and hosts, or is it something that could be generalized and implemented sanely in Lucene?

It's probably generalizeable.

The stuff that optimizes queries into filters is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

The deduping logic is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/NutchBean.java?view=markup

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to