Redirects and alias handling (LONG)

Andrzej Bialecki Tue, 14 Aug 2007 16:55:46 -0700

Hi all,

I'm going to create a JIRA issue out of this discussion, but I thinkit's more convenient to first exchange our initial ideas here ...

Redirect handling is a difficult subject for all search engines, but theway it's currently done in Nutch could use some improvement. The samegoes for handling aliases, i.e. the same sites that are accessible via aslightly different non-canonical URLs (i.e. they are not mirrors but thesame sites), which cannot be easily handled by url normalizers.


A. Problem description
======================

1. "Aliases" problem
---------------------------------------

This is a case where the same content is available from the same siteunder several equivalent URLs. Example:


   http://example.com/
   http://example.org/
   http://example.net/
   http://example.com/index.html
   http://www.example.com/
   http://www.example.com/index.html

These URLs yield the same page (there are no redirects involved here).For a human user it's obvious that they should be treated as one page.Another example would be sites that use farms of servers withround-robin DNS (e.g. IBM), so that there may be dozens or hundredsdifferent URLs like www-120.ibm.com/software/...,www-306.ibm.com/software/..., etc, to which users are redirected fromhttp://www.ibm.com/, and which contain exactly the same content.

Currently Nutch addresses this issue only at the deduplication stage,selecting the shortest URL (which may or may not be the right choice),i.e. in the end we get http://example.com/ as the only remaining URL inthe searchable index. IMHO users would expect thathttp://www.example.com/ would be the remaining one ... ? Also, we get 4different URLs with 4 different statuses (e.g. fetch times) in CrawlDb,which is not good.

Unfortunately, we cannot blindly assume that www.example.com andexample.com are equivalent aliases - a landmark example of this ishttp://internic.net/ versus http://www.internic.net/, which give twodifferent pages.

Probably this dilemma can be resolved by doing a graph analysis of aclose webgraph neighbourhood of duplicate pages. Google improves itsresults with manual intervention of site owners:


http://www.google.com/support/webmasters/bin/answer.py?answer=44232

This addresses only the www.example.com versus example.com issue,apparently the issue of / vs. /index.html vs. /index.htm vs./default.asp is handled through some other means.

Finally, a few interesting queries to run that show how Google treatssuch aliases:


http://www.google.com/search?q=site:example.com&hl=en&filter=0
http://www.google.com/search?q=site:www.example.com&hl=en&filter=0

http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org

(especially interesting is the result under this URL:http://lists.w3.org/Archives/Public/w3c-dist-auth/1999OctDec/0180.html)

http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org%2Findex.html
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org%2Findex.html
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org

2. Redirected pages
-------------------

First, the standard defines pretty clearly the meaning of 301 versus 302redirects, on the HTTP protocol level:


http://www.ietf.org/rfc/rfc2616.txt

Which is reflected in the common practice of major search engines:

http://www.google.com/support/webmasters/bin/answer.py?answer=40151
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html

Javascript redirects are treated differently, e.g. Yahoo treats themdifferently depending on the time between redirects. One scenario notdescribed there is how to treat cross-protocol redirects to the samedomain or the same url path. Example: http://www.example.com/secure ->https://www.example.com/

Recent versions of Nutch introduced specific status codes for pagesredirected permanently or temporarily, and target URL-s are stored inCrawlDatum.metadata. However, this information is not used anywhere atthe moment.

I propose to make necessary modifications to follow the algorithmdescribed on Yahoo-s pages referenced above. Note that when that pagesays "Yahoo indexes the 'source' URL" it really means that it processesthe content of the target page, but puts it under the source URL.


And this brings another interesting topic ...

3. Link and anchor information for aliases and redirects.
---------------------------------------------------------

This issue has been briefly discussed in NUTCH-353. Inlink informationshould be "merged" so that all link information from all "aliases" isaggregated, so that it points to a selected canonical target URL.


See also above sample queries from Google.


B. Design and implementation
============================

In order to select the correct "canonical" URL at each stage inredirection handling we should keep the accumulated "redirection path",which includes source URLs and redirection methods (temporary/permanent,protocol or content-level redirect, redirect delay). This way, when wearrive a the final page in the redirection path, we should be able toselect the canonical path.

We should also specify which intermediate URL we accept as the current"canonical" URL in case we haven't yet reached the end of redirections(e.g. when we don't follow redirects immediately, but only record themto be used in the next cycle).

We should introduce an "alias" status in CrawlDb and LinkDb, whichindicates that a given URL is a non-canonical alias of another URL. InCrawlDb, we should copy all accumulated metadata and put it into thetarget canonical CrawlDatum. In LinkDb, we should merge all inlinkspointing to non-canonical URLs so that they are assigned to thecanonical URL. In both cases we should still keep the non-canonical URLsin CrawlDb and LinkDb - however we could decide not to keep any of themetadata / inlinks there, just an "alias" flag and a pointer to thecanonical URL where all aggregated data is stored. CrawlDb andLinkDbReader may or may not hide this fact from their users - I think itwould be more efficient if users of this API would get the finalaggregated data right away, perhaps with an indicator that it wasobtained using a non-canonical URL ...

Regarding Lucene indexes - we could either duplicate all data for eachnon-canonical URL, i.e. create as many full-blown Lucene documents asmany there are aliases, or we could create special "redirect" documentsthat would point to a URL which contains the full data ...



That's it for now ... Any comments or suggestions to the above are welcome!

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Redirects and alias handling (LONG)

Reply via email to