This is an automated email from the ASF dual-hosted git repository.
snagel pushed a commit to branch 2.x
in repository https://gitbox.apache.org/repos/asf/nutch.git
The following commit(s) were added to refs/heads/2.x by this push:
new d62ece0 NUTCH-2527 URL filter: provide rules to exclude localhost and
private address spaces
d62ece0 is described below
commit d62ece00469fd6b2012418062602246f090e10c5
Author: Sebastian Nagel <[email protected]>
AuthorDate: Thu Apr 26 12:40:57 2018 +0200
NUTCH-2527 URL filter: provide rules to exclude localhost and private
address spaces
---
conf/regex-urlfilter.txt.template | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/conf/regex-urlfilter.txt.template
b/conf/regex-urlfilter.txt.template
index bcf9c87..b060cbb 100644
--- a/conf/regex-urlfilter.txt.template
+++ b/conf/regex-urlfilter.txt.template
@@ -16,6 +16,7 @@
# The default url filter.
# Better for whole-internet crawling.
+# Please comment/uncomment rules to your needs.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
@@ -35,5 +36,26 @@
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+# For safe web crawling if crawled content is exposed in a public search
interface:
+# - exclude private network addresses to avoid that information
+# can be leaked by placing links pointing to web interfaces of services
+# running on the crawling machines (e.g., HDFS, Hadoop YARN)
+# - in addition, file:// URLs should be either excluded by a URL filter rule
+# or ignored by not enabling protocol-file
+#
+# - exclude localhost and loop-back addresses
+# http://localhost:8080
+# http://127.0.0.1/ .. http://127.255.255.255/
+# http://[::1]/
+#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$)
+#
+# - exclude private IP address spaces
+# 10.0.0.0/8
+#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$)
+# 192.168.0.0/16
+#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+# 172.16.0.0/12
+#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$)
+
# accept anything else
+.
--
To stop receiving notification emails like this one, please contact
[email protected].