Solved.
1. The regex-urlfinder.txt is loaded by default on fetching
2. This regex expression worked for me (it selects everything into *.mk and *.mk/*):
+^http://([a-z0-9]*\.)+mk
3. The regex can be tested without the need to run the fetcher with:
cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter
Cheers, Emilijan
Byron Miller wrote:
Did you make sure to include the filters in your plugin settings (conf/nutch-site.xml)
I must admit, i haven't paid attention to check to see if the plugin is used during the fetch or only when you update the DB (or both?).
-----Original Message----- From: EM <[EMAIL PROTECTED]> To: [email protected] Date: Thu, 14 Apr 2005 12:35:09 -0400 Subject: How can I limit my fetching process?
Hi,
I cannot make nutch obey my preferences what to fetch from internet.
In the regex-urlfilter.txt and crawl-urlfilter.txt I have a line stating:
+^http://([a-z0-9]*\.)*.mk/
With which I hope to return all (and only) pages from the .mk domain.
However, when I try to run my fetch.sh script: -------- bin/nutch generate db segments s3=`ls -d segments/2* | tail -1` echo $s3
bin/nutch fetch $s3 bin/nutch updatedb db $s3 bin/nutch analyze db 2 bin/nutch index $s3 bin/nutch dedup segments dedup.tmp --------
I can see the fetcher returning .com domains also.
How can I limit my fetching process? Am I missing the obvious (did something wrong with the fetch script)?
Emilijan
