Agreed - looks like this list is too aggressive. A better one would be:
-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|png|pps|ppt|ps|psd|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xls|z|zip)\)?$
This removes xhtml, xml, php, jsp, py, pl, and cgi.
We've seen php/jsp/py/pl/cgi in our error logs as un-parsable, but
looks like most cases are when the server is miss-configured and
winds up returning the source code, as opposed to the result of
executing the code.
-- Ken
On Thu, 2005-12-01 at 18:53 +0000, Howie Wang wrote:
.And .xhtml seem like they
would be parsable by the default HTML parser.
Ditto for .xml. It is a valid (though seldom used) xhtml extension.
Howie
>From: Doug Cutting <[EMAIL PROTECTED]>
>
>Ken Krugler wrote:
>>For what it's worth, below is the filter list we're using for doing an
>>html-centric crawl (no word docs, for example). Using the (?i) means we
>>don't need to have upper & lower-case versions of the suffixes.
>>
>
>>-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$
> >
>This looks like a more complete suffix list.
>
>Should we use this as the default? By default only html and text parsers
>are enabled, so perhaps that's all we should accept.
>
> >Why do you exclude .php urls? These are simply dynamic pages, no?
> >Similarly, .jsp and .py are frequently suffixes that return html. Are
>there other suffixes we should remove from this list before we make it the
>default exclusion list?
>
>Doug
--
Rod Taylor <[EMAIL PROTECTED]>
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers