Re: nutch mergedb filter does not appear to be filtering

John Mendenhall Tue, 14 Oct 2008 15:28:42 -0700

> We are using nutch version nutch-2008-07-22_04-01-29.
> We have a crawldb with over 1 million urls.
> We need to remove (filter) 17000 urls.
> 
> We have created a new nutch configuration directory.
> The only difference between this configuration directory
> and the normal one is the automaton-urlfilter.txt,
> crawl-urlfilter.txt, and regex-urlfilter.txt files.
> 
> We have added the urls we would like removed listed
> before the normal patterns in the url filter files.
> 
> Here is how we list the urls to be removed (in the
> regex-urlfilter.txt file):
> 
>   -^http://www.domain.com/path1/path2/file1$
>   -^http://www.domain.com/path1/path2/file2$
> 
> The normal patterns are listed as follows:
> 
>   +^http://www.domain.com/path3/
>   +^http://www.domain.com/.*fileending1$
> 
> We run CrawlDbMerger command as follows:
> 
>   bin/nutch mergedb /full/patch/newcrawldb /full/patch/crawldb -filter
> 
> I modified the log4j.properties file entry for CrawlDbMerger as follows:
> 
>   log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=DEBUG,cmdstdout
> 
> It takes less than a couple minutes to run.  It does not output any debug
> statements.  
> 
> When I run bin/nutch readdb <crawldb> -stats for the original
> crawldb (/full/patch/crawldb) and the new crawldb (/full/patch/newcrawldb),
> the stats for both crawldbs are the same.
> 
> It appears it is doing a copy with no filtering.
> 
> I will continue trying different things.  I will post when I determine
> the problem.  I am hoping it is just something stupid I am doing.
> 
> Please let me know if there is anything specific I should be looking
> at first.  Thanks in advance for any guidance or ideas provided.


I found the problem.  I do not know how to fix the problem.

With the new nutch configuration directory, we set the env
var NUTCH_CONF_DIR.  I know the nutch script is getting this
value.  I put in some debug statements.  I can see it added
properly to CLASSPATH.  I also set HADOOP_CONF_DIR.  This
also does not have any effect.

I am checking the access times on the regex-urlfilter.txt file.
The new regex-urlfilter.txt is not accessed.  The process is
only accessing the regex-urlfilter.txt file in the $NUTCH_HOME/conf
directory.  It does not appear to be using the NUTCH_CONF_DIR.

Does anyone have any thoughts or ideas for what we can do to
get this to work with the NUTCH_CONF_DIR?  Thank you in
advance for any pointers.

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

Re: nutch mergedb filter does not appear to be filtering

Reply via email to