Ken Krugler wrote:
[..]
And even with deduping, we run into problems, especially for top-level
pages.
These often change slightly between crawls, so if http://example.com is
found during one pass, and a different http://www.example.com is found
at a later crawl, you wind up with two hits for a result. What's worse
is that typically the summary is exactly the same (from the body of the
page), so to a user it's painfully obvious that there are (near)
duplicates in the index.
To solve this, I think a near duplicate detector would need to be used
when collapsing similar URLs. If you did this only when two URLs appear
to be the same, I think it would be OK, as that's the most common case.
Thus it could be somewhat computationally expensive (e.g. a winnowing
ala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
Interesting paper, thanks for the pointer - I always wondered what
criteria to use to reduce the number of shingles, and this winnowing is
a simple enough recipe for creating page signatures. I may be tempted to
implement it ;)
There is a Signature implementation in Nutch that allows for small
differences in text (TextProfileSignature), but I guess it's not
sufficient in your case?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com