On May 1, 2006, at 10:38 AM, Doug Cutting wrote:
Nutch implements host-deduping roughly as follows:
To fetch the first 10 hits it first asks for the top-scoring 20 or
so. Then it uses a field cache to reduce this to just two from each
host. If it runs out of raw hits, then it re-runs the query, this
time for the top scoring 40 hits. But the query is modified this
time to exclude matches from hosts that have already returned more
than two hits. (Nutch also automatically converts clauses like "-
host:foo.com" into cached filters when "foo.com" occurs in more
than a certain percentage of documents.)
Is that an optimization which only works for Nutch and hosts, or is
it something that could be generalized and implemented sanely in Lucene?
Thus, in the worst case, it could take five queries to return the
top ten hits, but in practice I've never seen more than three, and
the re-query rate is usually quite low. Since raw hits are cheap
to compute, and, with a field cache, the host filtering is also
fast, to reduce the raw query rate one can simply start by
searching for a larger number of raw hits, with little performance
impact.
Great, thanks, it's good to know that in practice rerunning the
queries is not much of a concern.
BTW, clustering in Information Retrieval usually implies grouping
by vector distance using statistical methods:
http://en.wikipedia.org/wiki/Data_clustering
Exactly. I'd scanned this, but I haven't yet familiarized myself
with the different models.
It may be possible for both keyword fields e.g. "host" and non-
keyword fields e.g. "content" to be clustered using the same
algorithm and an interface like Hits.cluster(String fieldname, int
docsPerCluster). Retrieve each hit's vector for the specified field,
and map the docs into a unified term space, then cluster. For
"host" or any other keyword field, the boundaries will be stark and
the cost of calculation negligible. For "content", a more
sophisticated model will be required to group the docs and the cost
will be greater.
It is more expensive to calculate similarity based on the entire
document's contents rather than just a snippet chosen by the
Highlighter. However, it's presumably more accurate, and having the
Term Vectors pre-built at index time should help quite a bit. As the
number of terms increases, there is presumably a point at which the
cost becomes too great, but it might be a pretty large number of
terms. Dunno yet. It might make sense to have a "clusterContent"
field which is a truncated version of "content", which is vectored
but neither stored nor indexed.
After that, there's also the issue of generating cluster labels.
Lots of problems to be solved. But it seems to me that if the term
vectors are already there, that's an excellent start -- and if you're
using them for highlighting, you get the disk seeks for free.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]