[EMAIL PROTECTED] wrote: > hi, > i'm trying to run nutch in our clinicum center and i have a little problem. > we have a few intranet servers and i want that nutch skip a few > direcotries. > for example: > > http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/ > > i wrote this urls in the crawl-urlfilter.txt. for example: > > -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus > > but nothing happens. nutch don't skip this urls. and i don't know why... > > :( kann anyone help me? > > i'm cwaling with this command: > > bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log & > > i'm using the release 0.7.1
Hi David, do you have regex-urlfilter in your crawler-site-configfile or nutch-site-configfile? I suspect that the plugin might not yet be loaded. Also, do you have another "allow all URLs"-line above the one you mentioned, maybe? I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and not +, so I guess that should be fine). But if your URL does not have anything in front of sapdoku, maybe try dropping that part. Good luck, Stefan _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
