According to Jamie Anstice: > I've been having a bit of a look at the robots.txt handling code, and > while it works in most situations, it's not quite perfect. The HTML > 4.0 spec says that the Disallow lines in the robots.txt file should be > applied to the start of the URL's path. However, the current htDig > code matches at any point in the URL. While this is not usually a > problem when you have a line like "Diasllow: /www" this tends to stop > htdig from doing much at all. Look for a non-patch in a day or so > (it's a non-patch because my source has rather diverged from the CVS > version, and that's before I ran everything through astyle to apply > consistent formatting).
Thanks for the bug report and the non-patch. Looks like an easy fix. I was concerned that the problem existed in 3.1.6 as well, but I've just confirmed by carefully examining the code that it does the right thing. The bug was introduced in 3.2 during the move to regex support for robots.txt. Of course, this raises the question of why regex support is even being added for this. Doesn't the original standard require exact matches? The FAQ states that wildcards are not supported. (info.webcrawler.com seems to be gone now, but the standard seems to be hosted at www.robotstxt.org now. We need to update our links!) I did find a Draft version 2.0 standard for robots.txt, but it doesn't clearly define the format to use for regular expressions and says that pending discussion, version 2.0 semantics may not be implemented. So, why are we, and are we even doing it properly? -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

