This is an automated email from the ASF dual-hosted git repository. snagel pushed a commit to branch 2.x in repository https://gitbox.apache.org/repos/asf/nutch.git
The following commit(s) were added to refs/heads/2.x by this push: new d62ece0 NUTCH-2527 URL filter: provide rules to exclude localhost and private address spaces d62ece0 is described below commit d62ece00469fd6b2012418062602246f090e10c5 Author: Sebastian Nagel <sna...@apache.org> AuthorDate: Thu Apr 26 12:40:57 2018 +0200 NUTCH-2527 URL filter: provide rules to exclude localhost and private address spaces --- conf/regex-urlfilter.txt.template | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/conf/regex-urlfilter.txt.template b/conf/regex-urlfilter.txt.template index bcf9c87..b060cbb 100644 --- a/conf/regex-urlfilter.txt.template +++ b/conf/regex-urlfilter.txt.template @@ -16,6 +16,7 @@ # The default url filter. # Better for whole-internet crawling. +# Please comment/uncomment rules to your needs. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file @@ -35,5 +36,26 @@ # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ +# For safe web crawling if crawled content is exposed in a public search interface: +# - exclude private network addresses to avoid that information +# can be leaked by placing links pointing to web interfaces of services +# running on the crawling machines (e.g., HDFS, Hadoop YARN) +# - in addition, file:// URLs should be either excluded by a URL filter rule +# or ignored by not enabling protocol-file +# +# - exclude localhost and loop-back addresses +# http://localhost:8080 +# http://127.0.0.1/ .. http://127.255.255.255/ +# http://[::1]/ +#-^https?://(?:localhost|127(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3}|\[::1\])(?::\d+)?(?:/|$) +# +# - exclude private IP address spaces +# 10.0.0.0/8 +#-^https?://(?:10(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){3})(?::\d+)?(?:/|$) +# 192.168.0.0/16 +#-^https?://(?:192\.168(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$) +# 172.16.0.0/12 +#-^https?://(?:172\.(?:1[6789]|2[0-9]|3[01])(?:\.(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))){2})(?::\d+)?(?:/|$) + # accept anything else +. -- To stop receiving notification emails like this one, please contact sna...@apache.org.