On Mon, 2005-11-28 at 11:44 -0800, Doug Cutting wrote:
 Rod Taylor wrote:
 > Add a few more extensions which I commonly see and cannot be parsed
 > (that I am aware of). ZIP, mso, jar, bz2, XLS, pps, PPS, dot, etc.

 [ ... ]

 >  # skip image and other suffixes we can't yet parse
> --\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$
> > +-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|ps|wmf|zip|ZIP|ppt|mpg|xls|XLS||bz2|gz|mso|jar|rpm|tgz|Z|mov|MOV|exe|dot|pps|PPS)$

For what it's worth, below is the filter list we're using for doing an html-centric crawl (no word docs, for example). Using the (?i) means we don't need to have upper & lower-case versions of the suffixes.

-- Ken

-(?i)\.(ai|asf|au|avi|bz2|bin|bmp|c|cgi|class|css|dmg|doc|dot|dvi|eps|exe|gif|gz|h|hqx|ico|iso|jar|java|jnlp|jpeg|jpg|js|jsp|lha|md5|mov|mp3|mp4|mpg|msi|ogg|pdf|php|pl|png|pps|ppt|ps|psd|py|ram|rdf|rm|rpm|rss|rtf|sit|swf|tar|tbz|tbz2|tgz|tif|wav|wmf|wmv|xhtml|xls|xml|z|zip)\)?$


--
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Reply via email to