[EMAIL PROTECTED] wrote:
> hi,
> i'm trying to run nutch in our clinicum center and i have a little problem.
> we have a few intranet servers and i want that nutch skip a few
> direcotries.
> for example:
> 
> http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/
> 
> i wrote this urls in the crawl-urlfilter.txt. for example:
> 
> -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus
> 
> but nothing happens. nutch don't skip this urls. and i don't know why...
> 
> :( kann anyone help me?
> 
> i'm cwaling with this command:
> 
> bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &
> 
> i'm using the release 0.7.1

Hi David,

do you have regex-urlfilter in your crawler-site-configfile or
nutch-site-configfile? I suspect that the plugin might not yet be
loaded. Also, do you have another "allow all URLs"-line above the one
you mentioned, maybe?
I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and
not +, so I guess that should be fine). But if your URL does not have
anything in front of sapdoku, maybe try dropping that part.


Good luck,
 Stefan


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to