You can also stop nutch from crawling those pages by modifying the
robots.txt if you have set Nutch to respect those rules, by default it
will respect such rules. If you haven't modified the setting
http.robots.agents in nutch-default.xml/nutch-site.xml , the following
robots.txt rule should work:-

User-agent: NutchCVS
Disallow: /abteilung/pvs/dokus/

Cheers,
Jayant

On 6/21/06, Stefan Neufeind <[EMAIL PROTECTED]> wrote:
[EMAIL PROTECTED] wrote:
> hi,
> i'm trying to run nutch in our clinicum center and i have a little problem.
> we have a few intranet servers and i want that nutch skip a few
> direcotries.
> for example:
>
> http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/
>
> i wrote this urls in the crawl-urlfilter.txt. for example:
>
> -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus
>
> but nothing happens. nutch don't skip this urls. and i don't know why...
>
> :( kann anyone help me?
>
> i'm cwaling with this command:
>
> bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &
>
> i'm using the release 0.7.1

Hi David,

do you have regex-urlfilter in your crawler-site-configfile or
nutch-site-configfile? I suspect that the plugin might not yet be
loaded. Also, do you have another "allow all URLs"-line above the one
you mentioned, maybe?
I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and
not +, so I guess that should be fine). But if your URL does not have
anything in front of sapdoku, maybe try dropping that part.


Good luck,
 Stefan



--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi | +91-9871412929
M.Tech. Computer Tech. Class of 2007,
D-38, Aravali Hostel, IIT Delhi,
Hauz Khas, Delhi-110016

Reply via email to