I am not configuring crawl-urlfilter.txt because I am not using the "bin/nutch crawl" tool. Instead I am calling "bin/nutch generate", fetch, update, etc. from a script.
In that case I should be configuring regex-urlfilter.txt instead of crawl-urlfilter.txt. Am I right? On 5/30/07, Naess, Ronny <[EMAIL PROTECTED]> wrote:
If you are unsure of your regex you might want to try this regex applet http://jakarta.apache.org/regexp/applet.html Also, I do all my filtering in crawl-urlfilter.txt I guess you must also, unless you have configured crawl-tool.xml to use your other file. <property> <name>urlfilter.regex.file</name> <value>crawl-urlfilter.txt</value> </property> -Ronny -----Opprinnelig melding----- Fra: Manoharam Reddy [mailto:[EMAIL PROTECTED] Sendt: 30. mai 2007 13:42 Til: [email protected] Emne: I don't want to crawl internet sites This is my regex-urlfilter.txt file. -^http://([a-z0-9]*\.)+ +^http://([0-9]+\.)+ +. I want to allow only IP addresses and internal sites to be crawled and fetched. This means:- http://www.google.com should be ignored http://shoppingcenter should be crawled http://192.168.101.5 should be crawled. But when I see the logs, I find that http://someone.blogspot.com/ has also been crawled. How is it possible? Is my regex-urlfilter.txt wrong? Are there other URL filters? If so, in what order are the filters called? !DSPAM:465d634894881383415936!
