Hi all,
I'm going to create a JIRA issue out of this discussion, but I think
it's more convenient to first exchange our initial ideas here ...
Redirect handling is a difficult subject for all search engines, but the
way it's currently done in Nutch could use some improvement. The same
goes for handling aliases, i.e. the same sites that are accessible via a
slightly different non-canonical URLs (i.e. they are not mirrors but the
same sites), which cannot be easily handled by url normalizers.
A. Problem description
======================
1. "Aliases" problem
---------------------------------------
This is a case where the same content is available from the same site
under several equivalent URLs. Example:
http://example.com/
http://example.org/
http://example.net/
http://example.com/index.html
http://www.example.com/
http://www.example.com/index.html
These URLs yield the same page (there are no redirects involved here).
For a human user it's obvious that they should be treated as one page.
Another example would be sites that use farms of servers with
round-robin DNS (e.g. IBM), so that there may be dozens or hundreds
different URLs like www-120.ibm.com/software/...,
www-306.ibm.com/software/..., etc, to which users are redirected from
http://www.ibm.com/, and which contain exactly the same content.
Currently Nutch addresses this issue only at the deduplication stage,
selecting the shortest URL (which may or may not be the right choice),
i.e. in the end we get http://example.com/ as the only remaining URL in
the searchable index. IMHO users would expect that
http://www.example.com/ would be the remaining one ... ? Also, we get 4
different URLs with 4 different statuses (e.g. fetch times) in CrawlDb,
which is not good.
Unfortunately, we cannot blindly assume that www.example.com and
example.com are equivalent aliases - a landmark example of this is
http://internic.net/ versus http://www.internic.net/, which give two
different pages.
Probably this dilemma can be resolved by doing a graph analysis of a
close webgraph neighbourhood of duplicate pages. Google improves its
results with manual intervention of site owners:
http://www.google.com/support/webmasters/bin/answer.py?answer=44232
This addresses only the www.example.com versus example.com issue,
apparently the issue of / vs. /index.html vs. /index.htm vs.
/default.asp is handled through some other means.
Finally, a few interesting queries to run that show how Google treats
such aliases:
http://www.google.com/search?q=site:example.com&hl=en&filter=0
http://www.google.com/search?q=site:www.example.com&hl=en&filter=0
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org
(especially interesting is the result under this URL:
http://lists.w3.org/Archives/Public/w3c-dist-auth/1999OctDec/0180.html)
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org%2Findex.html
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org%2Findex.html
http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org
2. Redirected pages
-------------------
First, the standard defines pretty clearly the meaning of 301 versus 302
redirects, on the HTTP protocol level:
http://www.ietf.org/rfc/rfc2616.txt
Which is reflected in the common practice of major search engines:
http://www.google.com/support/webmasters/bin/answer.py?answer=40151
http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html
Javascript redirects are treated differently, e.g. Yahoo treats them
differently depending on the time between redirects. One scenario not
described there is how to treat cross-protocol redirects to the same
domain or the same url path. Example: http://www.example.com/secure ->
https://www.example.com/
Recent versions of Nutch introduced specific status codes for pages
redirected permanently or temporarily, and target URL-s are stored in
CrawlDatum.metadata. However, this information is not used anywhere at
the moment.
I propose to make necessary modifications to follow the algorithm
described on Yahoo-s pages referenced above. Note that when that page
says "Yahoo indexes the 'source' URL" it really means that it processes
the content of the target page, but puts it under the source URL.
And this brings another interesting topic ...
3. Link and anchor information for aliases and redirects.
---------------------------------------------------------
This issue has been briefly discussed in NUTCH-353. Inlink information
should be "merged" so that all link information from all "aliases" is
aggregated, so that it points to a selected canonical target URL.
See also above sample queries from Google.
B. Design and implementation
============================
In order to select the correct "canonical" URL at each stage in
redirection handling we should keep the accumulated "redirection path",
which includes source URLs and redirection methods (temporary/permanent,
protocol or content-level redirect, redirect delay). This way, when we
arrive a the final page in the redirection path, we should be able to
select the canonical path.
We should also specify which intermediate URL we accept as the current
"canonical" URL in case we haven't yet reached the end of redirections
(e.g. when we don't follow redirects immediately, but only record them
to be used in the next cycle).
We should introduce an "alias" status in CrawlDb and LinkDb, which
indicates that a given URL is a non-canonical alias of another URL. In
CrawlDb, we should copy all accumulated metadata and put it into the
target canonical CrawlDatum. In LinkDb, we should merge all inlinks
pointing to non-canonical URLs so that they are assigned to the
canonical URL. In both cases we should still keep the non-canonical URLs
in CrawlDb and LinkDb - however we could decide not to keep any of the
metadata / inlinks there, just an "alias" flag and a pointer to the
canonical URL where all aggregated data is stored. CrawlDb and
LinkDbReader may or may not hide this fact from their users - I think it
would be more efficient if users of this API would get the final
aggregated data right away, perhaps with an indicator that it was
obtained using a non-canonical URL ...
Regarding Lucene indexes - we could either duplicate all data for each
non-canonical URL, i.e. create as many full-blown Lucene documents as
many there are aliases, or we could create special "redirect" documents
that would point to a URL which contains the full data ...
That's it for now ... Any comments or suggestions to the above are welcome!
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com