Sriram Nookala created NUTCH-2365:
-------------------------------------

             Summary: HTTP Redirects to SubDomains don't get crawled
                 Key: NUTCH-2365
                 URL: https://issues.apache.org/jira/browse/NUTCH-2365
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.12
         Environment: Fedora 25
            Reporter: Sriram Nookala


Crawling a domain  http://www.mercenarytrader.com which redirects to 
https://members.mercenarytrader.com which doesn't get followed by Nutch even 
though 'db.ignore.external.links' is set to 'true' and 
'db.ignore.external.links.mode' is set to 'byDomain'. 
  The bug is in FetcherThread where the comparison is by host and not by domain

String origHost = new URL(urlString).getHost().toLowerCase();
>       String newHost = new URL(newUrl).getHost().toLowerCase();
>       if (ignoreExternalLinks) {
>         if (!origHost.equals(newHost)) {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug(" - ignoring redirect " + redirType + " from "
>                 + urlString + " to " + newUrl
>                 + " because external links are ignored");
>           }
>           return null;
>         }
>       }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to