Here's a better way http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
FYI, this will not remove non-English sites -- but international sites that follow the two-letter convention. CC- -----Original Message----- From: Jay Pound [mailto:[EMAIL PROTECTED] Sent: Monday, August 08, 2005 2:37 PM To: [email protected]; [email protected] Subject: regex-url filter I would like a confirmation from someone that this will work, I've edited the regex filter in hopes to weed out non-english sites from my search results, I'll be testing pruning on my current 40mil index to see if it works there, or maybe there is a way to set the search to return only english results, but I'm trying it this way now, is this the right way to add just extensions without sites? I'll try it soon but just wanted to not waste my time if its not correct!!! Thanks, -Jay Pound # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m ov|MOV|exe)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # accept US only sites +^http://([a-z0-9]*\.)*.com/ +^http://([a-z0-9]*\.)*.org/ +^http://([a-z0-9]*\.)*.edu/ +^http://([a-z0-9]*\.)*.net/ +^http://([a-z0-9]*\.)*.mil/ +^http://([a-z0-9]*\.)*.us/ +^http://([a-z0-9]*\.)*.info/ +^http://([a-z0-9]*\.)*.cc/ +^http://([a-z0-9]*\.)*.biz/
