Re: Returning a minimum number of clusters

Marvin Humphrey Mon, 01 May 2006 12:04:21 -0700


On May 1, 2006, at 10:38 AM, Doug Cutting wrote:

Nutch implements host-deduping roughly as follows:
To fetch the first 10 hits it first asks for the top-scoring 20 orso. Then it uses a field cache to reduce this to just two from eachhost. If it runs out of raw hits, then it re-runs the query, thistime for the top scoring 40 hits. But the query is modified thistime to exclude matches from hosts that have already returned morethan two hits. (Nutch also automatically converts clauses like "-host:foo.com" into cached filters when "foo.com" occurs in morethan a certain percentage of documents.)

Is that an optimization which only works for Nutch and hosts, or isit something that could be generalized and implemented sanely in Lucene?

Thus, in the worst case, it could take five queries to return thetop ten hits, but in practice I've never seen more than three, andthe re-query rate is usually quite low. Since raw hits are cheapto compute, and, with a field cache, the host filtering is alsofast, to reduce the raw query rate one can simply start bysearching for a larger number of raw hits, with little performanceimpact.

Great, thanks, it's good to know that in practice rerunning thequeries is not much of a concern.

BTW, clustering in Information Retrieval usually implies groupingby vector distance using statistical methods:
http://en.wikipedia.org/wiki/Data_clustering

Exactly. I'd scanned this, but I haven't yet familiarized myselfwith the different models.

It may be possible for both keyword fields e.g. "host" and non-keyword fields e.g. "content" to be clustered using the samealgorithm and an interface like Hits.cluster(String fieldname, intdocsPerCluster). Retrieve each hit's vector for the specified field,and map the docs into a unified term space, then cluster. For"host" or any other keyword field, the boundaries will be stark andthe cost of calculation negligible. For "content", a moresophisticated model will be required to group the docs and the costwill be greater.

It is more expensive to calculate similarity based on the entiredocument's contents rather than just a snippet chosen by theHighlighter. However, it's presumably more accurate, and having theTerm Vectors pre-built at index time should help quite a bit. As thenumber of terms increases, there is presumably a point at which thecost becomes too great, but it might be a pretty large number ofterms. Dunno yet. It might make sense to have a "clusterContent"field which is a truncated version of "content", which is vectoredbut neither stored nor indexed.

After that, there's also the issue of generating cluster labels.Lots of problems to be solved. But it seems to me that if the termvectors are already there, that's an excellent start -- and if you'reusing them for highlighting, you get the disk seeks for free.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Returning a minimum number of clusters

Reply via email to