is there any way to filter results to english via search, so I can setup a
multi-language search, I thought I saw somewhere that you could put
something into the form of the html, a switch while submiting the form that
would use a plugin to filter the results? I know I had seen some benchmarks
on a plugin made to do this
-Jay Pound

----- Original Message ----- 
From: "Chirag Chaman" <[EMAIL PROTECTED]>
To: <[email protected]>; <[email protected]>
Sent: Monday, August 08, 2005 3:02 PM
Subject: RE: regex-url filter


> Here's a better way
>
> http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
>
> FYI, this will not remove non-English sites -- but international sites
that
> follow the two-letter convention.
>
> CC-
>
> -----Original Message-----
> From: Jay Pound [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 08, 2005 2:37 PM
> To: [email protected]; [email protected]
> Subject: regex-url filter
>
> I would like a confirmation from someone that this will work, I've edited
> the regex filter in hopes to weed out non-english sites from my search
> results, I'll be testing pruning on my current 40mil index to see if it
> works there, or maybe there is a way to set the search to return only
> english results, but I'm trying it this way now, is this the right way to
> add just extensions without sites?
> I'll try it soon but just wanted to not waste my time if its not
correct!!!
> Thanks,
> -Jay Pound
> # The default url filter.
>
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
>
> # prefixed by '+' or '-'. The first matching pattern in the file
>
> # determines whether a URL is included or ignored. If no pattern
>
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
>
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz
|m
> ov|MOV|exe)$
>
> # skip URLs containing certain characters as probable queries, etc.
>
> [EMAIL PROTECTED]
>
> # accept US only sites
>
> +^http://([a-z0-9]*\.)*.com/
>
> +^http://([a-z0-9]*\.)*.org/
>
> +^http://([a-z0-9]*\.)*.edu/
>
> +^http://([a-z0-9]*\.)*.net/
>
> +^http://([a-z0-9]*\.)*.mil/
>
> +^http://([a-z0-9]*\.)*.us/
>
> +^http://([a-z0-9]*\.)*.info/
>
> +^http://([a-z0-9]*\.)*.cc/
>
> +^http://([a-z0-9]*\.)*.biz/
>
>
>
>
>


Reply via email to