Solved.

1. The regex-urlfinder.txt is loaded by default on fetching
2. This regex expression worked for me (it selects everything into *.mk and *.mk/*):


+^http://([a-z0-9]*\.)+mk

3. The regex can be tested without the need to run the fetcher with:

cat file-with-test-urls | nutch net/nutch/net/RegexURLFilter

Cheers,
Emilijan


Byron Miller wrote:

Did you make sure to include the filters in your plugin settings
(conf/nutch-site.xml)

I must admit, i haven't paid attention to check to see if the plugin is
used during the fetch or only when you update the DB (or both?).

-----Original Message-----
From: EM <[EMAIL PROTECTED]>
To: [email protected]
Date: Thu, 14 Apr 2005 12:35:09 -0400
Subject: How can I limit my fetching process?



Hi,

I cannot make nutch obey my preferences what to fetch from internet.

In the regex-urlfilter.txt and crawl-urlfilter.txt I have a line
stating:

+^http://([a-z0-9]*\.)*.mk/

With which I hope to return all (and only) pages from the .mk domain.

However, when I try to run my fetch.sh script:
--------
bin/nutch generate db segments
s3=`ls -d segments/2* | tail -1`
echo $s3

bin/nutch fetch $s3
bin/nutch updatedb db $s3
bin/nutch analyze db 2
bin/nutch index $s3
bin/nutch dedup segments dedup.tmp
--------

I can see the fetcher returning .com domains also.

How can I limit my fetching process? Am I missing the obvious (did something wrong with the fetch script)?

Emilijan








-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to