Re: Redirects and alias handling (LONG)

Ken Krugler Tue, 14 Aug 2007 18:10:31 -0700

Hi Andrzej,

Thanks for writing this up!


One small comment below...

I'm going to create a JIRA issue out of this discussion, but I thinkit's more convenient to first exchange our initial ideas here ...


[snip]

1. "Aliases" problem
---------------------------------------
This is a case where the same content is available from the samesite under several equivalent URLs. Example:
   http://example.com/
   http://example.org/
   http://example.net/
   http://example.com/index.html
   http://www.example.com/
   http://www.example.com/index.html
These URLs yield the same page (there are no redirects involvedhere). For a human user it's obvious that they should be treated asone page. Another example would be sites that use farms of serverswith round-robin DNS (e.g. IBM), so that there may be dozens orhundreds different URLs like www-120.ibm.com/software/...,www-306.ibm.com/software/..., etc, to which users are redirectedfrom http://www.ibm.com/, and which contain exactly the same content.
Currently Nutch addresses this issue only at the deduplicationstage, selecting the shortest URL (which may or may not be the rightchoice), i.e. in the end we get http://example.com/ as the onlyremaining URL in the searchable index. IMHO users would expect thathttp://www.example.com/ would be the remaining one ... ? Also, weget 4 different URLs with 4 different statuses (e.g. fetch times) inCrawlDb, which is not good.


And even with deduping, we run into problems, especially for top-level pages.

These often change slightly between crawls, so if http://example.comis found during one pass, and a different http://www.example.com isfound at a later crawl, you wind up with two hits for a result.What's worse is that typically the summary is exactly the same (fromthe body of the page), so to a user it's painfully obvious that thereare (near) duplicates in the index.

To solve this, I think a near duplicate detector would need to beused when collapsing similar URLs. If you did this only when two URLsappear to be the same, I think it would be OK, as that's the mostcommon case. Thus it could be somewhat computationally expensive(e.g. a winnowing alahttp://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf).


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Redirects and alias handling (LONG)

Reply via email to