Urlfilter Patch

Rod Taylor Fri, 11 Nov 2005 10:48:39 -0800

Add a few more extensions which I commonly see and cannot be parsed
(that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.


Add in additional lines (commented out by default) for quickly rejecting
URLs for extended content areas (doc, png, pdf, rtf, etc.) for people
who do not want anything but HTML or items with URLs that can get us the
HTML.

-- 
Rod Taylor <[EMAIL PROTECTED]>

Index: crawl-urlfilter.txt.template
===================================================================
--- crawl-urlfilter.txt.template	(revision 332425)
+++ crawl-urlfilter.txt.template	(working copy)
@@ -12,8 +12,11 @@
 -^(file|ftp|mailto):
 
 # skip image and other suffixes we can't yet parse
--\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
+-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$
 
+# Parsable documents which you may wish to reject
+# -\.(doc|DOC|png|PNG|pdf|PDF|rtf|xml)$
+
 # skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]
 
Index: regex-urlfilter.txt.template
===================================================================
--- regex-urlfilter.txt.template	(revision 332425)
+++ regex-urlfilter.txt.template	(working copy)
@@ -10,8 +10,11 @@
 -^(file|ftp|mailto):
 
 # skip image and other suffixes we can't yet parse
--\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
+-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$
 
+# Extended and Parsable documents which you may wish to reject
+# -\.(doc|DOC|png|PNG|pdf|PDF|rtf|xml)$
+
 # skip URLs containing certain characters as probable queries, etc.
 [EMAIL PROTECTED]

Urlfilter Patch

Reply via email to