Hi folks,

I just installed htdig-3.2.0b4-20030622, and discovered that it's not correctly handling Disallow: patterns from my robots.txt file. (I'm hoping this is the correct list to post this!)

I have these lines in my robots.txt:
User-agent: *
Disallow: /WebObjects/

In my config file, I do NOT exclude /cgi-bin/ via exclude_urls. However, when I rundig -vvv, it tells me that URLs like the following are rejected due to being "forbidden by server robots.txt":
href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah


This shouldn't happen. It should only be rejecting URLs *starting* with "/WebObjects/" (at least, that's my interpretation of what I read at http://www.robotstxt.org/wc/norobots.html).

If I remove the "Disallow: /WebObjects/" line from robots.txt and rerun rundig, it now indexes those URLs.

I never had this problem in 3.1.6. Has something changed?

--
Patrick Robinson
AHNR Info Technology, Virginia Tech
[EMAIL PROTECTED]



-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01
_______________________________________________
htdig-dev mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to