is there any way to filter results to english via search, so I can setup a multi-language search, I thought I saw somewhere that you could put something into the form of the html, a switch while submiting the form that would use a plugin to filter the results? I know I had seen some benchmarks on a plugin made to do this -Jay Pound
----- Original Message ----- From: "Chirag Chaman" <[EMAIL PROTECTED]> To: <[email protected]>; <[email protected]> Sent: Monday, August 08, 2005 3:02 PM Subject: RE: regex-url filter > Here's a better way > > http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/ > > FYI, this will not remove non-English sites -- but international sites that > follow the two-letter convention. > > CC- > > -----Original Message----- > From: Jay Pound [mailto:[EMAIL PROTECTED] > Sent: Monday, August 08, 2005 2:37 PM > To: [email protected]; [email protected] > Subject: regex-url filter > > I would like a confirmation from someone that this will work, I've edited > the regex filter in hopes to weed out non-english sites from my search > results, I'll be testing pruning on my current 40mil index to see if it > works there, or maybe there is a way to set the search to return only > english results, but I'm trying it this way now, is this the right way to > add just extensions without sites? > I'll try it soon but just wanted to not waste my time if its not correct!!! > Thanks, > -Jay Pound > # The default url filter. > > # Better for whole-internet crawling. > > # Each non-comment, non-blank line contains a regular expression > > # prefixed by '+' or '-'. The first matching pattern in the file > > # determines whether a URL is included or ignored. If no pattern > > # matches, the URL is ignored. > > # skip file: ftp: and mailto: urls > > -^(file|ftp|mailto): > > # skip image and other suffixes we can't yet parse > > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz |m > ov|MOV|exe)$ > > # skip URLs containing certain characters as probable queries, etc. > > [EMAIL PROTECTED] > > # accept US only sites > > +^http://([a-z0-9]*\.)*.com/ > > +^http://([a-z0-9]*\.)*.org/ > > +^http://([a-z0-9]*\.)*.edu/ > > +^http://([a-z0-9]*\.)*.net/ > > +^http://([a-z0-9]*\.)*.mil/ > > +^http://([a-z0-9]*\.)*.us/ > > +^http://([a-z0-9]*\.)*.info/ > > +^http://([a-z0-9]*\.)*.cc/ > > +^http://([a-z0-9]*\.)*.biz/ > > > > >
