Sriram Nookala created NUTCH-2365:
-------------------------------------
Summary: HTTP Redirects to SubDomains don't get crawled
Key: NUTCH-2365
URL: https://issues.apache.org/jira/browse/NUTCH-2365
Project: Nutch
Issue Type: Bug
Components: fetcher
Affects Versions: 1.12
Environment: Fedora 25
Reporter: Sriram Nookala
Crawling a domain http://www.mercenarytrader.com which redirects to
https://members.mercenarytrader.com which doesn't get followed by Nutch even
though 'db.ignore.external.links' is set to 'true' and
'db.ignore.external.links.mode' is set to 'byDomain'.
The bug is in FetcherThread where the comparison is by host and not by domain
String origHost = new URL(urlString).getHost().toLowerCase();
> String newHost = new URL(newUrl).getHost().toLowerCase();
> if (ignoreExternalLinks) {
> if (!origHost.equals(newHost)) {
> if (LOG.isDebugEnabled()) {
> LOG.debug(" - ignoring redirect " + redirType + " from "
> + urlString + " to " + newUrl
> + " because external links are ignored");
> }
> return null;
> }
> }
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)