RE: regex-url filter

Chirag Chaman Mon, 08 Aug 2005 12:06:17 -0700

Here's a better way

http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/


FYI, this will not remove non-English sites -- but international sites that
follow the two-letter convention.

CC-
 
-----Original Message-----
From: Jay Pound [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 08, 2005 2:37 PM
To: [email protected]; [email protected]
Subject: regex-url filter

I would like a confirmation from someone that this will work, I've edited
the regex filter in hopes to weed out non-english sites from my search
results, I'll be testing pruning on my current 40mil index to see if it
works there, or maybe there is a way to set the search to return only
english results, but I'm trying it this way now, is this the right way to
add just extensions without sites?
I'll try it soon but just wanted to not waste my time if its not correct!!!
Thanks,
-Jay Pound
# The default url filter.

# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression

# prefixed by '+' or '-'. The first matching pattern in the file

# determines whether a URL is included or ignored. If no pattern

# matches, the URL is ignored.

# skip file: ftp: and mailto: urls

-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse

-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|m
ov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.

[EMAIL PROTECTED]

# accept US only sites

+^http://([a-z0-9]*\.)*.com/

+^http://([a-z0-9]*\.)*.org/

+^http://([a-z0-9]*\.)*.edu/

+^http://([a-z0-9]*\.)*.net/

+^http://([a-z0-9]*\.)*.mil/

+^http://([a-z0-9]*\.)*.us/

+^http://([a-z0-9]*\.)*.info/

+^http://([a-z0-9]*\.)*.cc/

+^http://([a-z0-9]*\.)*.biz/

RE: regex-url filter

Reply via email to