using prefix-urlfilter instead of regular expressions

James Moore Mon, 21 Apr 2008 16:05:48 -0700

>From earlier messages here, it seems best to avoid the regular
expression filters and to use prefix-urlfilter instead.  (I had been
having issues with crawls stalling in the fetch phase - the maps would
finish, but the reduce never got past 16% or so).


I tried just listing the sites I cared about in prefix-urlfilter.txt like so:

http://example.com
http://anotherexample.com

And completely removing references to the
But that ends up fetching nothing:

rootUrlDir = /urls
threads = 1000
depth = 5
topN = 1000
Injector: starting
Injector: crawlDb: dipiti_crawl/crawldb
Injector: urlDir: /urls
Injector: Converting injected urls to crawl db entries.
Injector: Merging injected urls into crawl db.
Injector: done
Generator: Selecting best-scoring urls due for fetch.
Generator: starting
Generator: segment: dipiti_crawl/segments/20080421144048
Generator: filtering: true
Generator: topN: 1000
Generator: 0 records selected for fetching, exiting ...
segment is null
Exception in thread "main" java.lang.NullPointerException
        at org.apache.nutch.fetcher.Fetcher2.fetch(Fetcher2.java:924)
        at org.apache.nutch.crawl.DipitiCrawl.main(DipitiCrawl.java:117)

Am I missing something when people have suggested using prefixes
instead of regular expressions for matching?

-- 
James Moore | [EMAIL PROTECTED]
blog.restphone.com

using prefix-urlfilter instead of regular expressions

Reply via email to