Agreed - looks like this list is too aggressive. A better one would be:

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|png|pps|ppt|ps|psd|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$

This removes xhtml, xml, php, jsp, py, pl, and cgi.

We've seen php/jsp/py/pl/cgi in our error logs as un-parsable, but looks like most cases are when the server is miss-configured and winds up returning the source code, as opposed to the result of executing the code.

-- Ken

On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote:
 .And .xhtml seem like they
 would be parsable by the default HTML parser.

Ditto for .xml. It is a valid (though seldom used) xhtml extension.

 Howie

 >From: Doug Cutting <[EMAIL PROTECTED]>
 >
 >Ken Krugler wrote:
 >>For what it's worth, below is the filter list we're using for doing an
 >>html-centric crawl (no word docs, for example). Using the (?i) means we
 >>don't need to have upper & lower-case versions of the suffixes.
 >>
 > 
>>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
 > >
 >This looks like a more complete suffix list.
 >
 >Should we use this as the default?  By default only html and text parsers
 >are enabled, so perhaps that's all we should accept.
 >
 > >Why do you exclude .php urls?  These are simply dynamic pages, no?
 > >Similarly, .jsp and .py are frequently suffixes that return html.  Are
 >there other suffixes we should remove from this list before we make it the
 >default exclusion list?
 >
 >Doug



--
Rod Taylor <[EMAIL PROTECTED]>


--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to