[ 
https://issues.apache.org/jira/browse/NUTCH-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453811#comment-16453811
 ] 

ASF GitHub Bot commented on NUTCH-2527:
---------------------------------------

sebastian-nagel closed pull request #292: NUTCH-2527 URL filter: provide rules 
to exclude localhost and private address spaces
URL: https://github.com/apache/nutch/pull/292
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/regex-urlfilter.txt.template 
b/conf/regex-urlfilter.txt.template
index bcf9c87d7..b060cbb7b 100644
--- a/conf/regex-urlfilter.txt.template
+++ b/conf/regex-urlfilter.txt.template
@@ -16,6 +16,7 @@
 
 # The default url filter.
 # Better for whole-internet crawling.
+# Please comment/uncomment rules to your needs.
 
 # Each non-comment, non-blank line contains a regular expression
 # prefixed by '+' or '-'.  The first matching pattern in the file
@@ -35,5 +36,26 @@
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 
+# For safe web crawling if crawled content is exposed in a public search 
interface:
+# - exclude private network addresses to avoid that information
+#   can be leaked by placing links pointing to web interfaces of services
+#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
+# - in addition, file:// URLs should be either excluded by a URL filter rule
+#   or ignored by not enabling protocol-file
+#
+# - exclude localhost and loop-back addresses
+#     http://localhost:8080
+#     http://127.0.0.1/ .. http://127.255.255.255/
+#     http://[::1]/
+#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
+#
+# - exclude private IP address spaces
+#     10.0.0.0/8
+#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
+#     192.168.0.0/16
+#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+#     172.16.0.0/12
+#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+
 # accept anything else
 +.


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> URL filter: provide rules to exclude localhost and private address spaces
> -------------------------------------------------------------------------
>
>                 Key: NUTCH-2527
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2527
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 2.3.1, 1.14
>            Reporter: Sebastian Nagel
>            Priority: Minor
>             Fix For: 2.4, 1.15
>
>
> While checking the log files of a large web crawl, I've found hundreds of 
> (luckily failed) requests of local or private content:
> {noformat}
> 2018-02-18 04:48:34,022 INFO [FetcherThread] 
> org.apache.nutch.fetcher.Fetcher: fetching http://127.0.0.42/ ...
> 018-02-18 04:48:34,022 INFO [FetcherThread] org.apache.nutch.fetcher.Fetcher: 
> fetch of http://127.0.0.42/ failed with: java.net.ConnectException: 
> Connection refused (Connection refused)
> {noformat}
> URLs pointing to localhost, loop-back addresses, private address spaces 
> should be blocked for a wider web crawl where links are not controlled to 
> avoid that information is leaked by links or redirects pointing to web 
> interfaces of services running on the crawling machines (e.g., HDFS, Hadoop 
> YARN).
> Of course, this must be optional. For testing it's quite common to crawl your 
> local machine.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to