Matthew Holt wrote:
One more question.. I'm using nutch-0.8.0 and trying to index a domain and want to exclude a certain directory from the crawl. In the crawl-urlfilter.txt I have defined the following:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
-^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy

However, the /yummy directory is still crawled. Any ideas as to what is going on? Thanks..

Rules are processed in order, and processing is terminated whenever a rule matches. Your first rule allows all subdirs. Just swap these two rules and all should be ok.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to