Author: jnioche
Date: Mon Jul 18 09:26:39 2011
New Revision: 1147798

URL: http://svn.apache.org/viewvc?rev=1147798&view=rev
Log:
NUTCH-1043 Add pattern for filtering .js in default url filters

Modified:
    nutch/trunk/CHANGES.txt
    nutch/trunk/conf/automaton-urlfilter.txt.template
    nutch/trunk/conf/regex-urlfilter.txt.template

Modified: nutch/trunk/CHANGES.txt
URL: 
http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1147798&r1=1147797&r2=1147798&view=diff
==============================================================================
--- nutch/trunk/CHANGES.txt (original)
+++ nutch/trunk/CHANGES.txt Mon Jul 18 09:26:39 2011
@@ -2,6 +2,8 @@ Nutch Change Log
 
 Release 2.0 - Current Development
 
+* NUTCH-1043 Add pattern for filtering .js in default url filters (jnioche)
+
 * NUTCH-1027 Degrade log level of `can't find rules for scope` (markus)
 
 * NUTCH-1011 Normalize duplicate slashes in URL's (markus)

Modified: nutch/trunk/conf/automaton-urlfilter.txt.template
URL: 
http://svn.apache.org/viewvc/nutch/trunk/conf/automaton-urlfilter.txt.template?rev=1147798&r1=1147797&r2=1147798&view=diff
==============================================================================
--- nutch/trunk/conf/automaton-urlfilter.txt.template (original)
+++ nutch/trunk/conf/automaton-urlfilter.txt.template Mon Jul 18 09:26:39 2011
@@ -25,7 +25,8 @@
 -(file|ftp|mailto):.*
 
 # skip image and other suffixes we can't yet parse
--.*\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)
+# for a more extensive coverage use the urlfilter-suffix plugin
+-.*\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)
 
 # skip URLs containing certain characters as probable queries, etc.
 -.*[?*!@=].*

Modified: nutch/trunk/conf/regex-urlfilter.txt.template
URL: 
http://svn.apache.org/viewvc/nutch/trunk/conf/regex-urlfilter.txt.template?rev=1147798&r1=1147797&r2=1147798&view=diff
==============================================================================
--- nutch/trunk/conf/regex-urlfilter.txt.template (original)
+++ nutch/trunk/conf/regex-urlfilter.txt.template Mon Jul 18 09:26:39 2011
@@ -26,7 +26,8 @@
 -^(file|ftp|mailto):
 
 # skip image and other suffixes we can't yet parse
--\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
+# for a more extensive coverage use the urlfilter-suffix plugin
+-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
 
 # skip URLs containing certain characters as probable queries, etc.
 -[?*!@=]


Reply via email to