On May 1, 2006, at 10:38 AM, Doug Cutting wrote:

Nutch implements host-deduping roughly as follows:

To fetch the first 10 hits it first asks for the top-scoring 20 or so. Then it uses a field cache to reduce this to just two from each host. If it runs out of raw hits, then it re-runs the query, this time for the top scoring 40 hits. But the query is modified this time to exclude matches from hosts that have already returned more than two hits. (Nutch also automatically converts clauses like "- host:foo.com" into cached filters when "foo.com" occurs in more than a certain percentage of documents.)

Is that an optimization which only works for Nutch and hosts, or is it something that could be generalized and implemented sanely in Lucene?

Thus, in the worst case, it could take five queries to return the top ten hits, but in practice I've never seen more than three, and the re-query rate is usually quite low. Since raw hits are cheap to compute, and, with a field cache, the host filtering is also fast, to reduce the raw query rate one can simply start by searching for a larger number of raw hits, with little performance impact.

Great, thanks, it's good to know that in practice rerunning the queries is not much of a concern.

BTW, clustering in Information Retrieval usually implies grouping by vector distance using statistical methods:

http://en.wikipedia.org/wiki/Data_clustering

Exactly. I'd scanned this, but I haven't yet familiarized myself with the different models.

It may be possible for both keyword fields e.g. "host" and non- keyword fields e.g. "content" to be clustered using the same algorithm and an interface like Hits.cluster(String fieldname, int docsPerCluster). Retrieve each hit's vector for the specified field, and map the docs into a unified term space, then cluster. For "host" or any other keyword field, the boundaries will be stark and the cost of calculation negligible. For "content", a more sophisticated model will be required to group the docs and the cost will be greater.

It is more expensive to calculate similarity based on the entire document's contents rather than just a snippet chosen by the Highlighter. However, it's presumably more accurate, and having the Term Vectors pre-built at index time should help quite a bit. As the number of terms increases, there is presumably a point at which the cost becomes too great, but it might be a pretty large number of terms. Dunno yet. It might make sense to have a "clusterContent" field which is a truncated version of "content", which is vectored but neither stored nor indexed.

After that, there's also the issue of generating cluster labels. Lots of problems to be solved. But it seems to me that if the term vectors are already there, that's an excellent start -- and if you're using them for highlighting, you get the disk seeks for free.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to