I've been having a bit of a look at the robots.txt handling code, and while it works in most situations, it's not quite perfect. The HTML 4.0 spec says that the Disallow lines in the robots.txt file should be applied to the start of the URL's path. However, the current htDig code matches at any point in the URL. While this is not usually a problem when you have a line like "Diasllow: /www" this tends to stop htdig from doing much at all. Look for a non-patch in a day or so (it's a non-patch because my source has rather diverged from the CVS version, and that's before I ran everything through astyle to apply consistent formatting).
Jamie Anstice Search Scientist S.L.I. Systems [EMAIL PROTECTED] ph: 64 961 3262 mobile: 64 21 264 9347 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

