Re: Not crawling certain directories.

Andrzej Bialecki Sat, 15 Jul 2006 14:38:58 -0700

Matthew Holt wrote:

One more question.. I'm using nutch-0.8.0 and trying to index a domainand want to exclude a certain directory from the crawl. In thecrawl-urlfilter.txt I have defined the following:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
-^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy
However, the /yummy directory is still crawled. Any ideas as to whatis going on? Thanks..

Rules are processed in order, and processing is terminated whenever arule matches. Your first rule allows all subdirs. Just swap these tworules and all should be ok.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Not crawling certain directories.

Reply via email to