Hello,
Ok, it works fine.
Thanks
**********
Alvaro Cabrerizo:
Try to write your excluding patterns before accepting patterns. If I'm not
wrong nutch follows the order of the patterns. So it first check
+^http://toto.web-site.net adding all the urls you want
to skip with -^http://toto.web-
site.net/de/([a-z0-9]*)...
Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this:
...
# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)
# Website hostname for indexing
+^http://toto.web-site.net
# skip everything else
-.
Hope it helps.
2007/1/12, yleny @ ifrance. com :
>
> Hello,
>
> I want to exclude for indexing subdirectories in a website
> and i have not found the goods parameters.
> I use Nutch-0.7.2 because it is impossible
> for me to index with Nutch-0.8.1 (it crash).
>
> I want to exclude in my website the subdirectories :
> /de/*
> /en/*
> /fr/mv/*
>
> I try the command line
> -^http://toto.web-site.net/de/([a-z0-9]*)
> and
> -^http://toto.web-site.net/de/*
> in my crawl-urlfilter.txt file but
> they don't work and nutch index these url but i don't want this.
> Any idea ?
>
> I have the default regex-urlfilter.txt
> and my personnal crawl-urlfilter.txt is:
>
> # The url filter file used by the crawl command.
>
> # Better for intranet crawling.
> # Be sure to change MY.DOMAIN.NAME to your domain name.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in
> the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # accept hosts in MY.DOMAIN.NAME
> #+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/
>
> # Website hostname for indexing
> +^http://toto.web-site.net
>
> # URL to exclude for indexing
> -^http://toto.web-site.net/de/([a-z0-9]*)
> -^http://toto.web-site.net/en/([a-z0-9]*)
> -^http://toto.web-site.net/fr/mv/([a-z0-9]*)
>
> # skip everything else
> -.
>
>
> *********** my default regex-urlfilter.txt file is **************
>
> # The default url filter.
> # Better for whole-internet crawling.
>
> # Each non-comment, non-blank line contains a regular expression
> # prefixed by '+' or '-'. The first matching pattern in
> the file
> # determines whether a URL is included or ignored. If no pattern
> # matches, the URL is ignored.
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
>
> # skip URLs containing certain characters as probable queries, etc.
> [EMAIL PROTECTED]
>
> # accept anything else
> +.
> ________________________________________________________________________
> iFRANCE, exprimez-vous !
> http://web.ifrance.com
>
>
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general