nutch mergedb filter does not appear to be filtering

John Mendenhall Mon, 13 Oct 2008 14:29:26 -0700

We are using nutch version nutch-2008-07-22_04-01-29.
We have a crawldb with over 1 million urls.
We need to remove (filter) 17000 urls.


We have created a new nutch configuration directory.
The only difference between this configuration directory
and the normal one is the automaton-urlfilter.txt,
crawl-urlfilter.txt, and regex-urlfilter.txt files.

We have added the urls we would like removed listed
before the normal patterns in the url filter files.

Here is how we list the urls to be removed (in the
regex-urlfilter.txt file):

  -^http://www.domain.com/path1/path2/file1$
  -^http://www.domain.com/path1/path2/file2$

The normal patterns are listed as follows:

  +^http://www.domain.com/path3/
  +^http://www.domain.com/.*fileending1$

We run CrawlDbMerger command as follows:

  bin/nutch mergedb /full/patch/newcrawldb /full/patch/crawldb -filter

I modified the log4j.properties file entry for CrawlDbMerger as follows:

  log4j.logger.org.apache.nutch.crawl.CrawlDbMerger=DEBUG,cmdstdout

It takes less than a couple minutes to run.  It does not output any debug
statements.  

When I run bin/nutch readdb <crawldb> -stats for the original
crawldb (/full/patch/crawldb) and the new crawldb (/full/patch/newcrawldb),
the stats for both crawldbs are the same.

It appears it is doing a copy with no filtering.

I will continue trying different things.  I will post when I determine
the problem.  I am hoping it is just something stupid I am doing.

Please let me know if there is anything specific I should be looking
at first.  Thanks in advance for any guidance or ideas provided.

Thanks!

JohnM

-- 
john mendenhall
[EMAIL PROTECTED]
surf utopia
internet services

nutch mergedb filter does not appear to be filtering

Reply via email to