Matthew Holt wrote:
One more question.. I'm using nutch-0.8.0 and trying to index a domain
and want to exclude a certain directory from the crawl. In the
crawl-urlfilter.txt I have defined the following:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/
-^http://([a-z0-9]*\.)*wwwapps.mywebsite.com*/yummy
However, the /yummy directory is still crawled. Any ideas as to what
is going on? Thanks..
Rules are processed in order, and processing is terminated whenever a
rule matches. Your first rule allows all subdirs. Just swap these two
rules and all should be ok.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com