Hi Andrzej,

Thanks for writing this up!

One small comment below...

I'm going to create a JIRA issue out of this discussion, but I think it's more convenient to first exchange our initial ideas here ...

[snip]

1. "Aliases" problem
---------------------------------------
This is a case where the same content is available from the same site under several equivalent URLs. Example:

   http://example.com/
   http://example.org/
   http://example.net/
   http://example.com/index.html
   http://www.example.com/
   http://www.example.com/index.html

These URLs yield the same page (there are no redirects involved here). For a human user it's obvious that they should be treated as one page. Another example would be sites that use farms of servers with round-robin DNS (e.g. IBM), so that there may be dozens or hundreds different URLs like www-120.ibm.com/software/..., www-306.ibm.com/software/..., etc, to which users are redirected from http://www.ibm.com/, and which contain exactly the same content.

Currently Nutch addresses this issue only at the deduplication stage, selecting the shortest URL (which may or may not be the right choice), i.e. in the end we get http://example.com/ as the only remaining URL in the searchable index. IMHO users would expect that http://www.example.com/ would be the remaining one ... ? Also, we get 4 different URLs with 4 different statuses (e.g. fetch times) in CrawlDb, which is not good.

And even with deduping, we run into problems, especially for top-level pages.

These often change slightly between crawls, so if http://example.com is found during one pass, and a different http://www.example.com is found at a later crawl, you wind up with two hits for a result. What's worse is that typically the summary is exactly the same (from the body of the page), so to a user it's painfully obvious that there are (near) duplicates in the index.

To solve this, I think a near duplicate detector would need to be used when collapsing similar URLs. If you did this only when two URLs appear to be the same, I think it would be OK, as that's the most common case. Thus it could be somewhat computationally expensive (e.g. a winnowing ala http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to