[ https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2365: ----------------------------------- Summary: HTTP Redirects to SubDomains don't get crawled if (was: HTTP Redirects to SubDomains don't get crawled) > HTTP Redirects to SubDomains don't get crawled if > -------------------------------------------------- > > Key: NUTCH-2365 > URL: https://issues.apache.org/jira/browse/NUTCH-2365 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 1.12 > Environment: Fedora 25 > Reporter: Sriram Nookala > Assignee: Sebastian Nagel > Fix For: 1.14 > > > Crawling a domain http://www.mercenarytrader.com which redirects to > https://members.mercenarytrader.com which doesn't get followed by Nutch even > though 'db.ignore.external.links' is set to 'true' and > 'db.ignore.external.links.mode' is set to 'byDomain'. > The bug is in FetcherThread where the comparison is by host and not by > domain > String origHost = new URL(urlString).getHost().toLowerCase(); > > String newHost = new URL(newUrl).getHost().toLowerCase(); > > if (ignoreExternalLinks) { > > if (!origHost.equals(newHost)) { > > if (LOG.isDebugEnabled()) { > > LOG.debug(" - ignoring redirect " + redirType + " from " > > + urlString + " to " + newUrl > > + " because external links are ignored"); > > } > > return null; > > } > > } -- This message was sent by Atlassian JIRA (v6.4.14#64029)