Ken Krugler wrote:

[..]

And even with deduping, we run into problems, especially for top-level pages.

These often change slightly between crawls, so if http://example.com is found during one pass, and a different http://www.example.com is found at a later crawl, you wind up with two hits for a result. What's worse is that typically the summary is exactly the same (from the body of the page), so to a user it's painfully obvious that there are (near) duplicates in the index.

To solve this, I think a near duplicate detector would need to be used when collapsing similar URLs. If you did this only when two URLs appear to be the same, I think it would be OK, as that's the most common case. Thus it could be somewhat computationally expensive (e.g. a winnowing ala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).

Interesting paper, thanks for the pointer - I always wondered what criteria to use to reduce the number of shingles, and this winnowing is a simple enough recipe for creating page signatures. I may be tempted to implement it ;)

There is a Signature implementation in Nutch that allows for small differences in text (TextProfileSignature), but I guess it's not sufficient in your case?


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to