Re: Redirects and alias handling (LONG)

Andrzej Bialecki Wed, 15 Aug 2007 02:44:23 -0700

Ken Krugler wrote:

[..]

And even with deduping, we run into problems, especially for top-levelpages.
These often change slightly between crawls, so if http://example.com isfound during one pass, and a different http://www.example.com is foundat a later crawl, you wind up with two hits for a result. What's worseis that typically the summary is exactly the same (from the body of thepage), so to a user it's painfully obvious that there are (near)duplicates in the index.
To solve this, I think a near duplicate detector would need to be usedwhen collapsing similar URLs. If you did this only when two URLs appearto be the same, I think it would be OK, as that's the most common case.Thus it could be somewhat computationally expensive (e.g. a winnowingala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).

Interesting paper, thanks for the pointer - I always wondered whatcriteria to use to reduce the number of shingles, and this winnowing isa simple enough recipe for creating page signatures. I may be tempted toimplement it ;)

There is a Signature implementation in Nutch that allows for smalldifferences in text (TextProfileSignature), but I guess it's notsufficient in your case?



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Reply via email to