On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote: > .And .xhtml seem like they > would be parsable by the default HTML parser.
Ditto for .xml. It is a valid (though seldom used) xhtml extension. > Howie > > >From: Doug Cutting <[EMAIL PROTECTED]> > > > >Ken Krugler wrote: > >>For what it's worth, below is the filter list we're using for doing an > >>html-centric crawl (no word docs, for example). Using the (?i) means we > >>don't need to have upper & lower-case versions of the suffixes. > >> > >>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$ > > > >This looks like a more complete suffix list. > > > >Should we use this as the default? By default only html and text parsers > >are enabled, so perhaps that's all we should accept. > > > >Why do you exclude .php urls? These are simply dynamic pages, no? > >Similarly, .jsp and .py are frequently suffixes that return html. Are > >there other suffixes we should remove from this list before we make it the > >default exclusion list? > > > >Doug > > > -- Rod Taylor <[EMAIL PROTECTED]>
