ASF GitHub Bot commented on NUTCH-2527:

sebastian-nagel opened a new pull request #292: NUTCH-2527 URL filter: provide 
rules to exclude localhost and private address spaces
URL: https://github.com/apache/nutch/pull/292
   - provide rules for urlfilter-regex to exclude localhost, loop-back and 
private IP addresses
   - additional rules are not active by default to allow test crawls of content 
hosted locally

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> URL filter: provide rules to exclude localhost and private address spaces
> -------------------------------------------------------------------------
>                 Key: NUTCH-2527
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2527
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.4, 1.15
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.

This message was sent by Atlassian JIRA

Reply via email to