Sebastian Nagel created NUTCH-2527:
--------------------------------------

             Summary: URL filter: provide rules to exclude localhost and 
private address spaces
                 Key: NUTCH-2527
                 URL: https://issues.apache.org/jira/browse/NUTCH-2527
             Project: Nutch
          Issue Type: Improvement
    Affects Versions: 1.14, 2.3.1
            Reporter: Sebastian Nagel
             Fix For: 2.4, 1.15


While checking the log files of a large web crawl, I've found hundreds of 
(luckily failed) requests of local or private content:
{noformat}
2018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
fetching http://127.0.0.42/ ...
018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
fetch of http://127.0.0.42/ failed with: java.net.ConnectException: Connection 
refused (Connection refused)
{noformat}

URLs pointing to localhost, loop-back addresses, private address spaces should 
be blocked for a wider web crawl where links are not controlled to avoid that 
information is leaked by links or redirects pointing to web interfaces of 
services running on the crawling machines (e.g., HDFS, Hadoop YARN).

Of course, this must be optional. For testing it's quite common to crawl your 
local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to