[
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2365:
-----------------------------------
Summary: HTTP Redirects to SubDomains don't get crawled if
db.ignore.external.links.mode == byDomain (was: HTTP Redirects to SubDomains
don't get crawled if )
> HTTP Redirects to SubDomains don't get crawled if
> db.ignore.external.links.mode == byDomain
> -------------------------------------------------------------------------------------------
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 1.12
> Environment: Fedora 25
> Reporter: Sriram Nookala
> Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain http://www.mercenarytrader.com which redirects to
> https://members.mercenarytrader.com which doesn't get followed by Nutch even
> though 'db.ignore.external.links' is set to 'true' and
> 'db.ignore.external.links.mode' is set to 'byDomain'.
> The bug is in FetcherThread where the comparison is by host and not by
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> > String newHost = new URL(newUrl).getHost().toLowerCase();
> > if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> > if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> > }
> > return null;
> > }
> > }
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)