Re: [Nutch-general] problems to exclude subdirectories in a web site

Alvaro Cabrerizo Tue, 16 Jan 2007 07:59:51 -0800

Try to write your excluding patterns before accepting patterns. If I'm not
wrong nutch follows the order of the patterns. So  it first check
+^http://toto.web-site.net  <http://site.net/>adding all the urls you want
to skip with -^http://toto.web-
<http://site.net/>site.net/de/([a-z0-9]*)...<http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29>


Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this:

...
# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)<http://site.net/de/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/en/%28%5Ba-z0-9%5D*%29-%5Ehttp://toto.web-site.net/fr/mv/%28%5Ba-z0-9%5D*%29>

# Website hostname for indexing
+^http://toto.web-site.net

# skip everything else
-.


Hope it helps.

2007/1/12, yleny @ ifrance. com <[EMAIL PROTECTED]>:


Hello,

I want to exclude for indexing subdirectories in a website
and i have not found the goods parameters.
I use Nutch-0.7.2 because it is impossible
for me to index with Nutch-0.8.1 (it crash).

I want to exclude in my website the subdirectories :
/de/*
/en/*
/fr/mv/*

I try the command line
-^http://toto.web-site.net/de/([a-z0-9]*)
and
-^http://toto.web-site.net/de/*
in my crawl-urlfilter.txt file but
they don&#39;t work and nutch index these url but i don&#39;t want this.
Any idea ?

I have the default regex-urlfilter.txt
and my personnal crawl-urlfilter.txt is:

# The url filter file used by the crawl command.

# Better for intranet crawling.
# Be sure to change MY.DOMAIN.NAME to your domain name.

# Each non-comment, non-blank line contains a regular expression
# prefixed by &#39;+&#39; or &#39;-&#39;. The first matching pattern in
the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file:, ftp:, & mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can&#39;t yet parse

-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept hosts in MY.DOMAIN.NAME
#+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/

# Website hostname for indexing
+^http://toto.web-site.net

# URL to exclude for indexing
-^http://toto.web-site.net/de/([a-z0-9]*)
-^http://toto.web-site.net/en/([a-z0-9]*)
-^http://toto.web-site.net/fr/mv/([a-z0-9]*)

# skip everything else
-.


*********** my default regex-urlfilter.txt file is **************

# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by &#39;+&#39; or &#39;-&#39;. The first matching pattern in
the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can&#39;t yet parse

-.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# accept anything else
+.
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] problems to exclude subdirectories in a web site

Reply via email to