Hi, all. I've run into a problem with robots.txt directives not being applied
properly. All of our sites have robots.txt files that allow htdig full access (empty
Disallow:), and which may or may not place restrictions on other robots. Here's htdig
-v -v -v -v -v output from a site that has no restrictions:
Parsing robots.txt file using myname = htdig
Robots.txt line: # robots.txt for Environmental Assessment
Robots.txt line: User-agent: htdig
Found 'user-agent' line: htdig
Robots.txt line: Disallow:
Found 'disallow' line:
Robots.txt line: # Rest of world:
Robots.txt line: User-agent: *
Found 'user-agent' line: *
Robots.txt line: Disallow:
Pattern:
1 - Closing previous connection with the remote host
pushed
Rejected: forbidden by server robots.txt!
pick: eadev.acponline.org, # servers = 1
> eadev.acponline.org supports HTTP persistent connections (infinite)
ht://dig End Time: Thu Nov 13 09:52:02 2003
htdig is coming across as user agent 'htdig':
172.19.31.12 - - [13/Nov/2003:09:40:24 -0500] "HEAD /robots.txt HTTP/1.1" 200 0 "-"
"htdig"
172.19.31.12 - - [13/Nov/2003:09:40:24 -0500] "GET /robots.txt HTTP/1.1" 200 138 "-"
"htdig"
Removing the robots.txt file results in a normal run. Any ideas on what's causing this?
Neil Kohl
Manager, ACP Online
American College of Physicians
[EMAIL PROTECTED] 215.351.2638, 800.523.1546 x2638
-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general