You can also stop nutch from crawling those pages by modifying the robots.txt if you have set Nutch to respect those rules, by default it will respect such rules. If you haven't modified the setting http.robots.agents in nutch-default.xml/nutch-site.xml , the following robots.txt rule should work:-
User-agent: NutchCVS Disallow: /abteilung/pvs/dokus/ Cheers, Jayant On 6/21/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > hi, > > i'm trying to run nutch in our clinicum center and i have a little problem. > > we have a few intranet servers and i want that nutch skip a few > > direcotries. > > for example: > > > > http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/ > > > > i wrote this urls in the crawl-urlfilter.txt. for example: > > > > -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus > > > > but nothing happens. nutch don't skip this urls. and i don't know why... > > > > :( kann anyone help me? > > > > i'm cwaling with this command: > > > > bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log & > > > > i'm using the release 0.7.1 > > Hi David, > > do you have regex-urlfilter in your crawler-site-configfile or > nutch-site-configfile? I suspect that the plugin might not yet be > loaded. Also, do you have another "allow all URLs"-line above the one > you mentioned, maybe? > I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and > not +, so I guess that should be fine). But if your URL does not have > anything in front of sapdoku, maybe try dropping that part. > > > Good luck, > Stefan > -- www.jkg.in | http://www.jkg.in/contact-me/ Jayant Kr. Gandhi | +91-9871412929 M.Tech. Computer Tech. Class of 2007, D-38, Aravali Hostel, IIT Delhi, Hauz Khas, Delhi-110016 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
