[
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2527.
------------------------------------
Resolution: Implemented
Committed to 1.x
([1475fa3|https://github.com/apache/nutch/commit/1475fa3320897493124ab4339ee4728ac9a876ea])
and 2.x
([d62ece0|https://github.com/apache/nutch/commit/d62ece00469fd6b2012418062602246f090e10c5]).
> URL filter: provide rules to exclude localhost and private address spaces
> -------------------------------------------------------------------------
>
> Key: NUTCH-2527
> URL: https://issues.apache.org/jira/browse/NUTCH-2527
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 2.3.1, 1.14
> Reporter: Sebastian Nagel
> Assignee: Sebastian Nagel
> Priority: Minor
> Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread]
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher:
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException:
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces
> should be blocked for a wider web crawl where links are not controlled to
> avoid that information is leaked by links or redirects pointing to web
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your
> local machine.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)