[htdig] More robots.txt stuff

Jamie Anstice Thu, 06 Dec 2001 14:35:35 -0800

I've been having a bit of a look at the robots.txt handling code, and 
while it works in
most situations, it's not quite perfect. The HTML 4.0 spec says that the 
Disallow
lines in the robots.txt file should be applied to the start of the URL's 
path.
However, the current htDig code matches at any point in the URL.  While 
this 
is not usually a problem when you have a line like "Diasllow: /www" this
tends to stop htdig from doing much at all.  Look for a non-patch in a day 
or 
so (it's a non-patch because my source has rather diverged from the CVS
version, and that's before I ran everything through astyle to apply 
consistent
formatting).


Jamie Anstice
Search Scientist
S.L.I. Systems
[EMAIL PROTECTED]
ph:  64 961 3262
mobile: 64 21 264 9347

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

[htdig] More robots.txt stuff

Reply via email to