indexing subset of documents based on regex

Sebastiaan Raaphorst Thu, 05 Jun 2008 04:59:44 -0700

Hello,
 
I'am trying to crawl a number of sites containing news. I would like to
index only specific pages based on the url, e.g.
http://www.volkskrant.nl/[a-z]+/article[0-9]+.ece/.+ . It seems that
when i configure this in the  crawl-url filter nutch is unable to crawl
the complete site. (when there are no links between pages that match
this pattern). Is there another configuration option which permits nutch
to crawl the complete site and only index specific pages ?
 
Sebastiaan

indexing subset of documents based on regex

Reply via email to