RE: url filters

Marc DELERUE Wed, 11 May 2005 01:36:31 -0700

The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.


# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
.\(ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|gif|GIF|JPEG|jpeg|jpg|JPG)$
+\.(pdf|rtf|xls|doc|txt|htm|html)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+^http://([a-z0-9]*\.)*linux62.org/
-.

It finds regex-urlfilter.txt when it set up the crawl.

I you have any ideas...

Regard

Marc
-----Message d'origine-----
De�: Matthias Jaekle [mailto:[EMAIL PROTECTED] 
Envoy�: mercredi 11 mai 2005 10:27
��: [email protected]
Objet�: Re: url filters

Hi,
have you choosen the regex-urlfilter in the conf-file?
Could you please post your whole regex-file.
Matthias
-- 
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine f�r Lokales und Events

RE: url filters

Reply via email to