Hi, Thanks for bringing this up, Andrzej. There are some excellent pointers/remarks here.
On 8/15/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm going to create a JIRA issue out of this discussion, but I think > it's more convenient to first exchange our initial ideas here ... > > Redirect handling is a difficult subject for all search engines, but the > way it's currently done in Nutch could use some improvement. The same > goes for handling aliases, i.e. the same sites that are accessible via a > slightly different non-canonical URLs (i.e. they are not mirrors but the > same sites), which cannot be easily handled by url normalizers. > > A. Problem description > ====================== > > 1. "Aliases" problem > --------------------------------------- > This is a case where the same content is available from the same site > under several equivalent URLs. Example: > > http://example.com/ > http://example.org/ > http://example.net/ > http://example.com/index.html > http://www.example.com/ > http://www.example.com/index.html > > These URLs yield the same page (there are no redirects involved here). > For a human user it's obvious that they should be treated as one page. > Another example would be sites that use farms of servers with > round-robin DNS (e.g. IBM), so that there may be dozens or hundreds > different URLs like www-120.ibm.com/software/..., > www-306.ibm.com/software/..., etc, to which users are redirected from > http://www.ibm.com/, and which contain exactly the same content. > > Currently Nutch addresses this issue only at the deduplication stage, > selecting the shortest URL (which may or may not be the right choice), A small addition: Nutch can also select the page with the highest score. > i.e. in the end we get http://example.com/ as the only remaining URL in > the searchable index. IMHO users would expect that > http://www.example.com/ would be the remaining one ... ? Also, we get 4 > different URLs with 4 different statuses (e.g. fetch times) in CrawlDb, > which is not good. > > Unfortunately, we cannot blindly assume that www.example.com and > example.com are equivalent aliases - a landmark example of this is > http://internic.net/ versus http://www.internic.net/, which give two > different pages. > > Probably this dilemma can be resolved by doing a graph analysis of a > close webgraph neighbourhood of duplicate pages. Google improves its > results with manual intervention of site owners: > > http://www.google.com/support/webmasters/bin/answer.py?answer=44232 > > This addresses only the www.example.com versus example.com issue, > apparently the issue of / vs. /index.html vs. /index.htm vs. > /default.asp is handled through some other means. > > Finally, a few interesting queries to run that show how Google treats > such aliases: > > http://www.google.com/search?q=site:example.com&hl=en&filter=0 > http://www.google.com/search?q=site:www.example.com&hl=en&filter=0 > > http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org > (especially interesting is the result under this URL: > http://lists.w3.org/Archives/Public/w3c-dist-auth/1999OctDec/0180.html) > http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fwww.example.org%2Findex.html > http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org%2Findex.html > http://www.google.com/search?hl=en&lr=&as_qdr=all&q=link%3Ahttp%3A%2F%2Fexample.org > If the same content is available under multiple urls, I think it makes sense to assume that the url with the highest score should be 'the representative' url. > 2. Redirected pages > ------------------- > > First, the standard defines pretty clearly the meaning of 301 versus 302 > redirects, on the HTTP protocol level: > > http://www.ietf.org/rfc/rfc2616.txt > > Which is reflected in the common practice of major search engines: > > http://www.google.com/support/webmasters/bin/answer.py?answer=40151 > http://help.yahoo.com/l/nz/yahooxtra/search/webcrawler/slurp-11.html > > Javascript redirects are treated differently, e.g. Yahoo treats them > differently depending on the time between redirects. One scenario not > described there is how to treat cross-protocol redirects to the same > domain or the same url path. Example: http://www.example.com/secure -> > https://www.example.com/ > > Recent versions of Nutch introduced specific status codes for pages > redirected permanently or temporarily, and target URL-s are stored in > CrawlDatum.metadata. However, this information is not used anywhere at > the moment. > > I propose to make necessary modifications to follow the algorithm > described on Yahoo-s pages referenced above. Note that when that page > says "Yahoo indexes the 'source' URL" it really means that it processes > the content of the target page, but puts it under the source URL. +1. Yahoo's algorithm looks very solid. > > And this brings another interesting topic ... > > 3. Link and anchor information for aliases and redirects. > --------------------------------------------------------- > This issue has been briefly discussed in NUTCH-353. Inlink information > should be "merged" so that all link information from all "aliases" is > aggregated, so that it points to a selected canonical target URL. We should also merge their score. If example.com (with score 4.0) is an alias for www.example.com (with score 8.0), the selected url (which I think, as I said before, should be www.example.com) should end up with the score 12.0. We may not want to do this for aliases in different domains but I think we should definitely do this if two urls with the same content are under the same domain (like example.com). > > See also above sample queries from Google. > > > B. Design and implementation > ============================ > > In order to select the correct "canonical" URL at each stage in > redirection handling we should keep the accumulated "redirection path", > which includes source URLs and redirection methods (temporary/permanent, > protocol or content-level redirect, redirect delay). This way, when we > arrive a the final page in the redirection path, we should be able to > select the canonical path. > > We should also specify which intermediate URL we accept as the current > "canonical" URL in case we haven't yet reached the end of redirections > (e.g. when we don't follow redirects immediately, but only record them > to be used in the next cycle). > > We should introduce an "alias" status in CrawlDb and LinkDb, which > indicates that a given URL is a non-canonical alias of another URL. In > CrawlDb, we should copy all accumulated metadata and put it into the > target canonical CrawlDatum. In LinkDb, we should merge all inlinks > pointing to non-canonical URLs so that they are assigned to the > canonical URL. In both cases we should still keep the non-canonical URLs > in CrawlDb and LinkDb - however we could decide not to keep any of the > metadata / inlinks there, just an "alias" flag and a pointer to the > canonical URL where all aggregated data is stored. CrawlDb and > LinkDbReader may or may not hide this fact from their users - I think it > would be more efficient if users of this API would get the final > aggregated data right away, perhaps with an indicator that it was > obtained using a non-canonical URL ... > > Regarding Lucene indexes - we could either duplicate all data for each > non-canonical URL, i.e. create as many full-blown Lucene documents as > many there are aliases, or we could create special "redirect" documents > that would point to a URL which contains the full data ... We can avoid doing both. Let's assume A redirects to B, C also redirects to B and B redirects to D. After the fetch/parse/updatedb cycle that processes D we would probably have enough data to choose the 'canonical url' (let's assume that canonical is B). Then during Indexer's reduce we can just index parse text and parse data (and whatever else) of D under url B since we won't index B (or A or C) as itself (it doesn't have any useful content after all). > > > That's it for now ... Any comments or suggestions to the above are welcome! Andrzej, have you written any code? I would suggest that we open a JIRA and have some code (no matter how much half-baked it is) as soon as we can. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > -- Doğacan Güney
