[Nutch-general] robots.txt

david . wojciechowski Fri, 30 Jun 2006 01:30:15 -0700

hi

i use nutch 0.7.1 to crawl a few intranetserver.
yesterday i tried to exclude some directories with the robots.txt.
but nothing changed.
i copied this robots.txt to the server:


User-agent: NutchCVS
Disallow: /cgi-bin/
Disallow: /manuals/

the User-agent "NutchCVS" and the robots agent name in nutch-default
is the same.

can anyone helps me with this problem?

i'm crawling with this command:

bin/nutch crawl urls -dir crawl060621 -depth 15 &> crawl060621.log &

greets david

==========================================================

David Wojciechowski
Universitätsklinikum Freiburg
Klinikrechenzentrum
Agnesenstrasse 6-8
D-79106 Freiburg

Telefon :  0761 / 270 - 1842
Fax: 0761 / 270 - 2276
E-Mail   :  [EMAIL PROTECTED]

==========================================================


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] robots.txt

Reply via email to