[ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2527.
------------------------------------
    Resolution: Implemented

Committed to 1.x 
([1475fa3|https://github.com/apache/nutch/commit/1475fa3320897493124ab4339ee4728ac9a876ea])
 and 2.x 
([d62ece0|https://github.com/apache/nutch/commit/d62ece00469fd6b2012418062602246f090e10c5]).

> URL filter: provide rules to exclude localhost and private address spaces
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-2527
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2527
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to