Re: Returning a minimum number of clusters

Doug Cutting Mon, 01 May 2006 15:51:47 -0700

Marvin Humphrey wrote:

On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
Nutch implements host-deduping roughly as follows:
To fetch the first 10 hits it first asks for the top-scoring 20 orso. Then it uses a field cache to reduce this to just two from eachhost. If it runs out of raw hits, then it re-runs the query, thistime for the top scoring 40 hits. But the query is modified thistime to exclude matches from hosts that have already returned morethan two hits. (Nutch also automatically converts clauses like "-host:foo.com" into cached filters when "foo.com" occurs in more thana certain percentage of documents.)
Is that an optimization which only works for Nutch and hosts, or is itsomething that could be generalized and implemented sanely in Lucene?


It's probably generalizeable.

The stuff that optimizes queries into filters is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/LuceneQueryOptimizer.java?view=markup

The deduping logic is in:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/searcher/NutchBean.java?view=markup

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Returning a minimum number of clusters

Reply via email to