Hi Andrzej,
Thanks for writing this up!
One small comment below...
I'm going to create a JIRA issue out of this discussion, but I think
it's more convenient to first exchange our initial ideas here ...
[snip]
1. "Aliases" problem
---------------------------------------
This is a case where the same content is available from the same
site under several equivalent URLs. Example:
http://example.com/
http://example.org/
http://example.net/
http://example.com/index.html
http://www.example.com/
http://www.example.com/index.html
These URLs yield the same page (there are no redirects involved
here). For a human user it's obvious that they should be treated as
one page. Another example would be sites that use farms of servers
with round-robin DNS (e.g. IBM), so that there may be dozens or
hundreds different URLs like www-120.ibm.com/software/...,
www-306.ibm.com/software/..., etc, to which users are redirected
from http://www.ibm.com/, and which contain exactly the same content.
Currently Nutch addresses this issue only at the deduplication
stage, selecting the shortest URL (which may or may not be the right
choice), i.e. in the end we get http://example.com/ as the only
remaining URL in the searchable index. IMHO users would expect that
http://www.example.com/ would be the remaining one ... ? Also, we
get 4 different URLs with 4 different statuses (e.g. fetch times) in
CrawlDb, which is not good.
And even with deduping, we run into problems, especially for top-level pages.
These often change slightly between crawls, so if http://example.com
is found during one pass, and a different http://www.example.com is
found at a later crawl, you wind up with two hits for a result.
What's worse is that typically the summary is exactly the same (from
the body of the page), so to a user it's painfully obvious that there
are (near) duplicates in the index.
To solve this, I think a near duplicate detector would need to be
used when collapsing similar URLs. If you did this only when two URLs
appear to be the same, I think it would be OK, as that's the most
common case. Thus it could be somewhat computationally expensive
(e.g. a winnowing ala
http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"