Hi,

I cannot make nutch obey my preferences what to fetch from internet.

In the regex-urlfilter.txt and crawl-urlfilter.txt I have a line stating:

+^http://([a-z0-9]*\.)*.mk/

With which I hope to return all (and only) pages from the .mk domain.

However, when I try to run my fetch.sh script:
--------
bin/nutch generate db segments
s3=`ls -d segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb db $s3
bin/nutch analyze db 2
bin/nutch index $s3
bin/nutch dedup segments dedup.tmp
--------

I can see the fetcher returning .com domains also.

How can I limit my fetching process? Am I missing the obvious (did something wrong with the fetch script)?

Emilijan

Reply via email to