Hello,
 
I'am trying to crawl a number of sites containing news. I would like to
index only specific pages based on the url, e.g.
http://www.volkskrant.nl/[a-z]+/article[0-9]+.ece/.+ . It seems that
when i configure this in the  crawl-url filter nutch is unable to crawl
the complete site. (when there are no links between pages that match
this pattern). Is there another configuration option which permits nutch
to crawl the complete site and only index specific pages ?
 
Sebastiaan
 

Reply via email to