This is an automated email from the ASF dual-hosted git repository.

snagel pushed a commit to branch 2.x
in repository https://gitbox.apache.org/repos/asf/nutch.git


The following commit(s) were added to refs/heads/2.x by this push:
     new d62ece0  NUTCH-2527 URL filter: provide rules to exclude localhost and 
private address spaces
d62ece0 is described below

commit d62ece00469fd6b2012418062602246f090e10c5
Author: Sebastian Nagel <sna...@apache.org>
AuthorDate: Thu Apr 26 12:40:57 2018 +0200

    NUTCH-2527 URL filter: provide rules to exclude localhost and private 
address spaces
---
 conf/regex-urlfilter.txt.template | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/conf/regex-urlfilter.txt.template 
b/conf/regex-urlfilter.txt.template
index bcf9c87..b060cbb 100644
--- a/conf/regex-urlfilter.txt.template
+++ b/conf/regex-urlfilter.txt.template
@@ -16,6 +16,7 @@
 
 # The default url filter.
 # Better for whole-internet crawling.
+# Please comment/uncomment rules to your needs.
 
 # Each non-comment, non-blank line contains a regular expression
 # prefixed by '+' or '-'.  The first matching pattern in the file
@@ -35,5 +36,26 @@
 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
 -.*(/[^/]+)/[^/]+\1/[^/]+\1/
 
+# For safe web crawling if crawled content is exposed in a public search 
interface:
+# - exclude private network addresses to avoid that information
+#   can be leaked by placing links pointing to web interfaces of services
+#   running on the crawling machines (e.g., HDFS, Hadoop YARN)
+# - in addition, file:// URLs should be either excluded by a URL filter rule
+#   or ignored by not enabling protocol-file
+#
+# - exclude localhost and loop-back addresses
+#     http://localhost:8080
+#     http://127.0.0.1/ .. http://127.255.255.255/
+#     http://[::1]/
+#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
+#
+# - exclude private IP address spaces
+#     10.0.0.0/8
+#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
+#     192.168.0.0/16
+#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+#     172.16.0.0/12
+#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+
 # accept anything else
 +.

-- 
To stop receiving notification emails like this one, please contact
sna...@apache.org.

Reply via email to