I just installed htdig-3.2.0b4-20030622, and discovered that it's not correctly handling Disallow: patterns from my robots.txt file. (I'm hoping this is the correct list to post this!)
I have these lines in my robots.txt: User-agent: * Disallow: /WebObjects/
In my config file, I do NOT exclude /cgi-bin/ via exclude_urls. However, when I rundig -vvv, it tells me that URLs like the following are rejected due to being "forbidden by server robots.txt":
href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah
This shouldn't happen. It should only be rejecting URLs *starting* with "/WebObjects/" (at least, that's my interpretation of what I read at http://www.robotstxt.org/wc/norobots.html).
If I remove the "Disallow: /WebObjects/" line from robots.txt and rerun rundig, it now indexes those URLs.
I never had this problem in 3.1.6. Has something changed?
-- Patrick Robinson AHNR Info Technology, Virginia Tech [EMAIL PROTECTED]
------------------------------------------------------- This SF.Net email sponsored by: Free pre-built ASP.NET sites including Data Reports, E-commerce, Portals, and Forums are available now. Download today and enter to win an XBOX or Visual Studio .NET. http://aspnet.click-url.com/go/psa00100006ave/direct;at.asp_061203_01/01 _______________________________________________ htdig-dev mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/htdig-dev