Hi Andrzej,
And even with deduping, we run into problems, especially for top-level pages.
These often change slightly between crawls, so if
http://example.com is found during one pass, and a different
http://www.example.com is found at a later crawl, you wind up with
two hits for a result. What's worse is that typically the summary
is exactly the same (from the body of the page), so to a user it's
painfully obvious that there are (near) duplicates in the index.
To solve this, I think a near duplicate detector would need to be
used when collapsing similar URLs. If you did this only when two
URLs appear to be the same, I think it would be OK, as that's the
most common case. Thus it could be somewhat computationally
expensive (e.g. a winnowing ala
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
Interesting paper, thanks for the pointer - I always wondered what
criteria to use to reduce the number of shingles, and this winnowing
is a simple enough recipe for creating page signatures. I may be
tempted to implement it ;)
I took a quick scan through the public code and didn't find anything
that looked appropriate for this. One more potentially useful paper
is here:
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf
There is a Signature implementation in Nutch that allows for small
differences in text (TextProfileSignature), but I guess it's not
sufficient in your case?
I thought we were using that, but I just double-checked and we're
not. So I'll try to switch over to that for the next crawl/index, to
see how well it works.
Thanks,
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"