Author: jnioche
Date: Mon Jul 18 09:26:39 2011
New Revision: 1147798
URL: http://svn.apache.org/viewvc?rev=1147798&view=rev
Log:
NUTCH-1043 Add pattern for filtering .js in default url filters
Modified:
nutch/trunk/CHANGES.txt
nutch/trunk/conf/automaton-urlfilter.txt.template
nutch/trunk/conf/regex-urlfilter.txt.template
Modified: nutch/trunk/CHANGES.txt
URL:
http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1147798&r1=1147797&r2=1147798&view=diff
==============================================================================
--- nutch/trunk/CHANGES.txt (original)
+++ nutch/trunk/CHANGES.txt Mon Jul 18 09:26:39 2011
@@ -2,6 +2,8 @@ Nutch Change Log
Release 2.0 - Current Development
+* NUTCH-1043 Add pattern for filtering .js in default url filters (jnioche)
+
* NUTCH-1027 Degrade log level of `can't find rules for scope` (markus)
* NUTCH-1011 Normalize duplicate slashes in URL's (markus)
Modified: nutch/trunk/conf/automaton-urlfilter.txt.template
URL:
http://svn.apache.org/viewvc/nutch/trunk/conf/automaton-urlfilter.txt.template?rev=1147798&r1=1147797&r2=1147798&view=diff
==============================================================================
--- nutch/trunk/conf/automaton-urlfilter.txt.template (original)
+++ nutch/trunk/conf/automaton-urlfilter.txt.template Mon Jul 18 09:26:39 2011
@@ -25,7 +25,8 @@
-(file|ftp|mailto):.*
# skip image and other suffixes we can't yet parse
--.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)
+# for a more extensive coverage use the urlfilter-suffix plugin
+-.*\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)
# skip URLs containing certain characters as probable queries, etc.
-.*[?*!@=].*
Modified: nutch/trunk/conf/regex-urlfilter.txt.template
URL:
http://svn.apache.org/viewvc/nutch/trunk/conf/regex-urlfilter.txt.template?rev=1147798&r1=1147797&r2=1147798&view=diff
==============================================================================
--- nutch/trunk/conf/regex-urlfilter.txt.template (original)
+++ nutch/trunk/conf/regex-urlfilter.txt.template Mon Jul 18 09:26:39 2011
@@ -26,7 +26,8 @@
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
--\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+# for a more extensive coverage use the urlfilter-suffix plugin
+-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]