Re: url filters

Jack Tang Wed, 11 May 2005 01:47:45 -0700

Hi Marc

I suppose you should focus on crawl-urlfilter.txt but not
regex-urlfilter.txt when you run "crawl" command


Regards
/Jack

On 5/11/05, Marc DELERUE <[EMAIL PROTECTED]> wrote:
> The default url filter.
> # Better for whole-internet crawling.
> 
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'.  The first matching pattern in the file
> # determines whether a URL is included or ignored.  If no pattern
> # matches, the URL is ignored.
> 
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
> 
> # skip image and other suffixes we can't yet parse
> .\(ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|gif|GIF|JPEG|jpeg|jpg|JPG)$
> +\.(pdf|rtf|xls|doc|txt|htm|html)$
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
> 
> # accept anything else
> +^http://([a-z0-9]*\.)*linux62.org/
> -.
> 
> It finds regex-urlfilter.txt when it set up the crawl.
> 
> I you have any ideas...
> 
> Regard
> 
> Marc
> -----Message d'origine-----
> De: Matthias Jaekle [mailto:[EMAIL PROTECTED]
> Envoy�: mercredi 11 mai 2005 10:27
> �: [email protected]
> Objet: Re: url filters
> 
> Hi,
> have you choosen the regex-urlfilter in the conf-file?
> Could you please post your whole regex-file.
> Matthias
> --
> http://www.eventax.com - eventax GmbH
> http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events
>

Re: url filters

Reply via email to