> We are using nutch version nutch-2008-07-22_04-01-29. > We have a crawldb with over 1 million urls. > We need to remove (filter) 17000 urls. > > We have created a new nutch configuration directory. > The only difference between this configuration directory > and the normal one is the automaton-urlfilter.txt, > crawl-urlfilter.txt, and regex-urlfilter.txt files. > > We have added the urls we would like removed listed > before the normal patterns in the url filter files. > > Here is how we list the urls to be removed (in the > regex-urlfilter.txt file): > > -^http://www.domain.com/path1/path2/file1$ > -^http://www.domain.com/path1/path2/file2$ > > The normal patterns are listed as follows: > > +^http://www.domain.com/path3/ > +^http://www.domain.com/.*fileending1$ > > We run CrawlDbMerger command as follows: > > bin/nutch mergedb /full/patch/newcrawldb /full/patch/crawldb -filter > > I modified the log4j.properties file entry for CrawlDbMerger as follows: > > log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=DEBUG,cmdstdout > > It takes less than a couple minutes to run. It does not output any debug > statements. > > When I run bin/nutch readdb <crawldb> -stats for the original > crawldb (/full/patch/crawldb) and the new crawldb (/full/patch/newcrawldb), > the stats for both crawldbs are the same. > > It appears it is doing a copy with no filtering. > > I will continue trying different things. I will post when I determine > the problem. I am hoping it is just something stupid I am doing. > > Please let me know if there is anything specific I should be looking > at first. Thanks in advance for any guidance or ideas provided.
I found the problem. I do not know how to fix the problem. With the new nutch configuration directory, we set the env var NUTCH_CONF_DIR. I know the nutch script is getting this value. I put in some debug statements. I can see it added properly to CLASSPATH. I also set HADOOP_CONF_DIR. This also does not have any effect. I am checking the access times on the regex-urlfilter.txt file. The new regex-urlfilter.txt is not accessed. The process is only accessing the regex-urlfilter.txt file in the $NUTCH_HOME/conf directory. It does not appear to be using the NUTCH_CONF_DIR. Does anyone have any thoughts or ideas for what we can do to get this to work with the NUTCH_CONF_DIR? Thank you in advance for any pointers. JohnM -- john mendenhall [EMAIL PROTECTED] surf utopia internet services
