You can also stop nutch from crawling those pages by modifying the
robots.txt if you have set Nutch to respect those rules, by default it
will respect such rules. If you haven't modified the setting
http.robots.agents in nutch-default.xml/nutch-site.xml , the following
robots.txt rule should work:-

User-agent: NutchCVS
Disallow: /abteilung/pvs/dokus/

Cheers,
Jayant

On 6/21/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
> > hi,
> > i'm trying to run nutch in our clinicum center and i have a little problem.
> > we have a few intranet servers and i want that nutch skip a few
> > direcotries.
> > for example:
> >
> > http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/
> >
> > i wrote this urls in the crawl-urlfilter.txt. for example:
> >
> > -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus
> >
> > but nothing happens. nutch don't skip this urls. and i don't know why...
> >
> > :( kann anyone help me?
> >
> > i'm cwaling with this command:
> >
> > bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &
> >
> > i'm using the release 0.7.1
>
> Hi David,
>
> do you have regex-urlfilter in your crawler-site-configfile or
> nutch-site-configfile? I suspect that the plugin might not yet be
> loaded. Also, do you have another "allow all URLs"-line above the one
> you mentioned, maybe?
> I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and
> not +, so I guess that should be fine). But if your URL does not have
> anything in front of sapdoku, maybe try dropping that part.
>
>
> Good luck,
>  Stefan
>


-- 
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi | +91-9871412929
M.Tech. Computer Tech. Class of 2007,
D-38, Aravali Hostel, IIT Delhi,
Hauz Khas, Delhi-110016


_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to