Re: Returning a minimum number of clusters

Doug Cutting Mon, 01 May 2006 10:38:38 -0700

Marvin Humphrey wrote:

The problem I'm trying to solve is how to return a minimum number ofclusters from a search. Say the most relevant 100 documents for aquery are all from the same domain, but you want a maximum of tworesults per domain, a la Google. I don't see any alternative torerunning the query an indeterminate number of times until you'veaccumulated sufficient clusters, because the search logic doesn't knowwhat cluster a document belongs to until the document vector is retrieved.
Is there a better way?


Nutch implements host-deduping roughly as follows:

To fetch the first 10 hits it first asks for the top-scoring 20 or so.Then it uses a field cache to reduce this to just two from each host.If it runs out of raw hits, then it re-runs the query, this time for thetop scoring 40 hits. But the query is modified this time to excludematches from hosts that have already returned more than two hits.(Nutch also automatically converts clauses like "-host:foo.com" intocached filters when "foo.com" occurs in more than a certain percentageof documents.) Thus, in the worst case, it could take five queries toreturn the top ten hits, but in practice I've never seen more thanthree, and the re-query rate is usually quite low. Since raw hits arecheap to compute, and, with a field cache, the host filtering is alsofast, to reduce the raw query rate one can simply start by searching fora larger number of raw hits, with little performance impact.

BTW, clustering in Information Retrieval usually implies grouping byvector distance using statistical methods:


http://en.wikipedia.org/wiki/Data_clustering

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Returning a minimum number of clusters

Reply via email to