I *think* that format is correct. I assume you injected matching URLs in the inject step... you can -dump the DB and double check they are in there. Note that limiting by, say, http://example.com will miss http://www.example.com, or at least that's what I'd think without trying.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: James Moore <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, April 21, 2008 7:05:14 PM > Subject: using prefix-urlfilter instead of regular expressions > > From earlier messages here, it seems best to avoid the regular > expression filters and to use prefix-urlfilter instead. (I had been > having issues with crawls stalling in the fetch phase - the maps would > finish, but the reduce never got past 16% or so). > > I tried just listing the sites I cared about in prefix-urlfilter.txt like so: > > http://example.com > http://anotherexample.com > > And completely removing references to the > But that ends up fetching nothing: > > rootUrlDir = /urls > threads = 1000 > depth = 5 > topN = 1000 > Injector: starting > Injector: crawlDb: dipiti_crawl/crawldb > Injector: urlDir: /urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: dipiti_crawl/segments/20080421144048 > Generator: filtering: true > Generator: topN: 1000 > Generator: 0 records selected for fetching, exiting ... > segment is null > Exception in thread "main" java.lang.NullPointerException > at org.apache.nutch.fetcher.Fetcher2.fetch(Fetcher2.java:924) > at org.apache.nutch.crawl.DipitiCrawl.main(DipitiCrawl.java:117) > > Am I missing something when people have suggested using prefixes > instead of regular expressions for matching? > > -- > James Moore | [EMAIL PROTECTED] > blog.restphone.com
