.pl files are often just perl CGI scripts. And .xhtml seem like they
would be parsable by the default HTML parser.
Howie
From: Doug Cutting <[EMAIL PROTECTED]>
Ken Krugler wrote:
For what it's worth, below is the filter list we're using for doing an
html-centric crawl (no word docs, for example). Using the (?i) means we
don't need to have upper & lower-case versions of the suffixes.
-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
This looks like a more complete suffix list.
Should we use this as the default? By default only html and text parsers
are enabled, so perhaps that's all we should accept.
Why do you exclude .php urls? These are simply dynamic pages, no?
Similarly, .jsp and .py are frequently suffixes that return html. Are
there other suffixes we should remove from this list before we make it the
default exclusion list?
Doug