[jira] [Created] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled

Sriram Nookala (JIRA) Thu, 09 Mar 2017 07:02:09 -0800

Sriram Nookala created NUTCH-2365:
-------------------------------------

             Summary: HTTP Redirects to SubDomains don't get crawled
                 Key: NUTCH-2365
                 URL: https://issues.apache.org/jira/browse/NUTCH-2365
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 1.12
         Environment: Fedora 25
            Reporter: Sriram Nookala



Crawling a domain  http://www.mercenarytrader.com which redirects to 
https://members.mercenarytrader.com which doesn't get followed by Nutch even 
though 'db.ignore.external.links' is set to 'true' and 
'db.ignore.external.links.mode' is set to 'byDomain'. 
  The bug is in FetcherThread where the comparison is by host and not by domain

String origHost = new URL(urlString).getHost().toLowerCase();
>       String newHost = new URL(newUrl).getHost().toLowerCase();
>       if (ignoreExternalLinks) {
>         if (!origHost.equals(newHost)) {
>           if (LOG.isDebugEnabled()) {
>             LOG.debug(" - ignoring redirect " + redirType + " from "
>                 + urlString + " to " + newUrl
>                 + " because external links are ignored");
>           }
>           return null;
>         }
>       }



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Created] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled

Reply via email to