Re: Urlfilter Patch

Doug Cutting Thu, 01 Dec 2005 10:43:33 -0800

Ken Krugler wrote:

For what it's worth, below is the filter list we're using for doing anhtml-centric crawl (no word docs, for example). Using the (?i) means wedon't need to have upper & lower-case versions of the suffixes.
-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$


This looks like a more complete suffix list.

Should we use this as the default? By default only html and textparsers are enabled, so perhaps that's all we should accept.

Why do you exclude .php urls? These are simply dynamic pages, no?Similarly, .jsp and .py are frequently suffixes that return html. Arethere other suffixes we should remove from this list before we make itthe default exclusion list?


Doug

Re: Urlfilter Patch

Reply via email to