Hello all,
I just completed a crawl with Nutch and wanted to test the searching. When I first created the merged index, I had forgotten to dedup first. However, after I ran dedup and re-merged the index, I discovered that a page with a high rank had been effectively deleted. I think the problem may lie in how Nutch handles redirects.
When Nutch attempts to fetch a URL that replies with a redirect, Nutch will follow the redirect and download the page. However that content is then credited to the original link and not the URL that we actually downloaded the content from. Consider the example where we have the true URL (www.example.com) as one of our seed URLs. Later we crawl a URL that redirects to www.example.com (www.somewackysite.com/?redir=18903). The content gets associated with www.somewackysite.com/?redir=18903. When we run dedup, it finds duplicate content hashes and deletes the one for www.example.com because that was fetched prior to www.somewackysite.com/?redir=18903. The content for www.example.com is still available for searching, but the valuable anchor text for links to www.example.com is lost.
The obvious hack is to just ignore any URLs that redirect (set http.redirect.max to 0). However, this may be an undesirable restriction.
Another possible solution would be to follow redirects until they end (or until we reach http.redirect.max), but instead of crediting the original redirecting URL with the content of the page it redirects to, Nutch would generate content for the original redirecting URL. So for our example, Nutch would make the content for www.somewackysite.com/?redir=18903 to be something like this:
<html><body><a href="http://www.example.com"/></a></body></html>
Therefore, after the parsing step, the parse_data of the redirecting URL would have a link to www.example.com, but parse_text would be empty.
Does that make sense? Does anybody have any brighter ideas on how to rid Nutch of this problem?
Thanks,
Luke Baker
