Did you make sure to include the filters in your plugin settings
(conf/nutch-site.xml)

I must admit, i haven't paid attention to check to see if the plugin is
used during the fetch or only when you update the DB (or both?).

-----Original Message-----
From: EM <[EMAIL PROTECTED]>
To: nutch-user@incubator.apache.org
Date: Thu, 14 Apr 2005 12:35:09 -0400
Subject: How can I limit my fetching process?

> Hi,
> 
> I cannot make nutch obey my preferences what to fetch from internet.
> 
> In the regex-urlfilter.txt and crawl-urlfilter.txt I have a line
> stating:
> 
> +^http://([a-z0-9]*\.)*.mk/
> 
> With which I hope to return all (and only) pages from the .mk domain.
> 
> However, when I try to run my fetch.sh script:
> --------
> bin/nutch generate db segments
> s3=`ls -d segments/2* | tail -1`
> echo $s3
> 
> bin/nutch fetch $s3
> bin/nutch updatedb db $s3
> bin/nutch analyze db 2
> bin/nutch index $s3
> bin/nutch dedup segments dedup.tmp
> --------
> 
> I can see the fetcher returning .com domains also.
> 
> How can I limit my fetching process? Am I missing the obvious (did 
> something wrong with the fetch script)?
> 
> Emilijan
> 

Reply via email to