We are using nutch version nutch-2008-07-22_04-01-29. We have a crawldb with over 1 million urls. We need to remove (filter) 17000 urls.
We have created a new nutch configuration directory. The only difference between this configuration directory and the normal one is the automaton-urlfilter.txt, crawl-urlfilter.txt, and regex-urlfilter.txt files. We have added the urls we would like removed listed before the normal patterns in the url filter files. Here is how we list the urls to be removed (in the regex-urlfilter.txt file): -^http://www.domain.com/path1/path2/file1$ -^http://www.domain.com/path1/path2/file2$ The normal patterns are listed as follows: +^http://www.domain.com/path3/ +^http://www.domain.com/.*fileending1$ We run CrawlDbMerger command as follows: bin/nutch mergedb /full/patch/newcrawldb /full/patch/crawldb -filter I modified the log4j.properties file entry for CrawlDbMerger as follows: log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=DEBUG,cmdstdout It takes less than a couple minutes to run. It does not output any debug statements. When I run bin/nutch readdb <crawldb> -stats for the original crawldb (/full/patch/crawldb) and the new crawldb (/full/patch/newcrawldb), the stats for both crawldbs are the same. It appears it is doing a copy with no filtering. I will continue trying different things. I will post when I determine the problem. I am hoping it is just something stupid I am doing. Please let me know if there is anything specific I should be looking at first. Thanks in advance for any guidance or ideas provided. Thanks! JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
