>From earlier messages here, it seems best to avoid the regular expression filters and to use prefix-urlfilter instead. (I had been having issues with crawls stalling in the fetch phase - the maps would finish, but the reduce never got past 16% or so).
I tried just listing the sites I cared about in prefix-urlfilter.txt like so: http://example.com http://anotherexample.com And completely removing references to the But that ends up fetching nothing: rootUrlDir = /urls threads = 1000 depth = 5 topN = 1000 Injector: starting Injector: crawlDb: dipiti_crawl/crawldb Injector: urlDir: /urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: done Generator: Selecting best-scoring urls due for fetch. Generator: starting Generator: segment: dipiti_crawl/segments/20080421144048 Generator: filtering: true Generator: topN: 1000 Generator: 0 records selected for fetching, exiting ... segment is null Exception in thread "main" java.lang.NullPointerException at org.apache.nutch.fetcher.Fetcher2.fetch(Fetcher2.java:924) at org.apache.nutch.crawl.DipitiCrawl.main(DipitiCrawl.java:117) Am I missing something when people have suggested using prefixes instead of regular expressions for matching? -- James Moore | [EMAIL PROTECTED] blog.restphone.com
