Add a few more extensions which I commonly see and cannot be parsed (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.
Add in additional lines (commented out by default) for quickly rejecting URLs for extended content areas (doc, png, pdf, rtf, etc.) for people who do not want anything but HTML or items with URLs that can get us the HTML. -- Rod Taylor <[EMAIL PROTECTED]>
Index: crawl-urlfilter.txt.template =================================================================== --- crawl-urlfilter.txt.template (revision 332425) +++ crawl-urlfilter.txt.template (working copy) @@ -12,8 +12,11 @@ -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$ +# Parsable documents which you may wish to reject +# -\.(doc|DOC|png|PNG|pdf|PDF|rtf|xml)$ + # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] Index: regex-urlfilter.txt.template =================================================================== --- regex-urlfilter.txt.template (revision 332425) +++ regex-urlfilter.txt.template (working copy) @@ -10,8 +10,11 @@ -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$ +# Extended and Parsable documents which you may wish to reject +# -\.(doc|DOC|png|PNG|pdf|PDF|rtf|xml)$ + # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED]
