Re: [Nutch-general] problems to exclude subdirectories in a web site

yleny Fri, 19 Jan 2007 06:08:11 -0800

Hello,

Ok, it works fine.


Thanks

**********

Alvaro Cabrerizo:
 Try to write your excluding patterns before accepting patterns. If I&#39;m not
 wrong nutch follows the order of the patterns. So  it first check
 +^http://toto.web-site.net  adding all the urls you want
 to skip with -^http://toto.web-
 site.net/de/([a-z0-9]*)...
 
 Then your crawl-urlfilter.txt or regex-urlfilter.txt should look like this:
 
 ...
 # URL to exclude for indexing
 -^http://toto.web-site.net/de/([a-z0-9]*)
 -^http://toto.web-site.net/en/([a-z0-9]*)
 -^http://toto.web-site.net/fr/mv/([a-z0-9]*)
 
 # Website hostname for indexing
 +^http://toto.web-site.net
 
 # skip everything else
 -.
 
 
 Hope it helps.
 
 2007/1/12, yleny @ ifrance. com :
 >
 > Hello,
 >
 > I want to exclude for indexing subdirectories in a website
 > and i have not found the goods parameters.
 > I use Nutch-0.7.2 because it is impossible
 > for me to index with Nutch-0.8.1 (it crash).
 >
 > I want to exclude in my website the subdirectories :
 > /de/*
 > /en/*
 > /fr/mv/*
 >
 > I try the command line
 > -^http://toto.web-site.net/de/([a-z0-9]*)
 > and
 > -^http://toto.web-site.net/de/*
 > in my crawl-urlfilter.txt file but
 > they don&#39;t work and nutch index these url but i don&#39;t want this.
 > Any idea ?
 >
 > I have the default regex-urlfilter.txt
 > and my personnal crawl-urlfilter.txt is:
 >
 > # The url filter file used by the crawl command.
 >
 > # Better for intranet crawling.
 > # Be sure to change MY.DOMAIN.NAME to your domain name.
 >
 > # Each non-comment, non-blank line contains a regular expression
 > # prefixed by &#39;+&#39; or &#39;-&#39;. The first matching pattern in
 > the file
 > # determines whether a URL is included or ignored. If no pattern
 > # matches, the URL is ignored.
 >
 > # skip file:, ftp:, & mailto: urls
 > -^(file|ftp|mailto):
 >
 > # skip image and other suffixes we can&#39;t yet parse
 >
 > -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|PNG)$
 >
 > # skip URLs containing certain characters as probable queries, etc.
 > [EMAIL PROTECTED]
 >
 > # accept hosts in MY.DOMAIN.NAME
 > #+^http://([a-z0-9]*.)*MY.DOMAIN.NAME/
 >
 > # Website hostname for indexing
 > +^http://toto.web-site.net
 >
 > # URL to exclude for indexing
 > -^http://toto.web-site.net/de/([a-z0-9]*)
 > -^http://toto.web-site.net/en/([a-z0-9]*)
 > -^http://toto.web-site.net/fr/mv/([a-z0-9]*)
 >
 > # skip everything else
 > -.
 >
 >
 > *********** my default regex-urlfilter.txt file is **************
 >
 > # The default url filter.
 > # Better for whole-internet crawling.
 >
 > # Each non-comment, non-blank line contains a regular expression
 > # prefixed by &#39;+&#39; or &#39;-&#39;. The first matching pattern in
 > the file
 > # determines whether a URL is included or ignored. If no pattern
 > # matches, the URL is ignored.
 >
 > # skip file: ftp: and mailto: urls
 > -^(file|ftp|mailto):
 >
 > # skip image and other suffixes we can&#39;t yet parse
 >
 > -.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$
 >
 > # skip URLs containing certain characters as probable queries, etc.
 > [EMAIL PROTECTED]
 >
 > # accept anything else
 > +.
 > ________________________________________________________________________
 > iFRANCE, exprimez-vous !
 > http://web.ifrance.com
 >
 >
 
________________________________________________________________________
iFRANCE, exprimez-vous !
http://web.ifrance.com

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] problems to exclude subdirectories in a web site

Reply via email to