Re: using prefix-urlfilter instead of regular expressions

ogjunk-nutch Mon, 21 Apr 2008 16:46:42 -0700

I *think* that format is correct.  I assume you injected matching URLs in the 
inject step... you can -dump the DB and double check they are in there.  Note 
that limiting by, say, http://example.com will miss http://www.example.com, or 
at least that's what I'd think without trying.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: James Moore <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, April 21, 2008 7:05:14 PM
> Subject: using prefix-urlfilter instead of regular expressions
> 
> From earlier messages here, it seems best to avoid the regular
> expression filters and to use prefix-urlfilter instead.  (I had been
> having issues with crawls stalling in the fetch phase - the maps would
> finish, but the reduce never got past 16% or so).
> 
> I tried just listing the sites I cared about in prefix-urlfilter.txt like so:
> 
> http://example.com
> http://anotherexample.com
> 
> And completely removing references to the
> But that ends up fetching nothing:
> 
> rootUrlDir = /urls
> threads = 1000
> depth = 5
> topN = 1000
> Injector: starting
> Injector: crawlDb: dipiti_crawl/crawldb
> Injector: urlDir: /urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: done
> Generator: Selecting best-scoring urls due for fetch.
> Generator: starting
> Generator: segment: dipiti_crawl/segments/20080421144048
> Generator: filtering: true
> Generator: topN: 1000
> Generator: 0 records selected for fetching, exiting ...
> segment is null
> Exception in thread "main" java.lang.NullPointerException
>         at org.apache.nutch.fetcher.Fetcher2.fetch(Fetcher2.java:924)
>         at org.apache.nutch.crawl.DipitiCrawl.main(DipitiCrawl.java:117)
> 
> Am I missing something when people have suggested using prefixes
> instead of regular expressions for matching?
> 
> -- 
> James Moore | [EMAIL PROTECTED]
> blog.restphone.com

Re: using prefix-urlfilter instead of regular expressions

Reply via email to