Re: [htdig] More robots.txt stuff

Gilles Detillieux Fri, 07 Dec 2001 11:56:22 -0800

According to Jamie Anstice:
> I've been having a bit of a look at the robots.txt handling code, and
> while it works in most situations, it's not quite perfect. The HTML
> 4.0 spec says that the Disallow lines in the robots.txt file should be
> applied to the start of the URL's path.  However, the current htDig
> code matches at any point in the URL.  While this is not usually a
> problem when you have a line like "Diasllow: /www" this tends to stop
> htdig from doing much at all.  Look for a non-patch in a day or so
> (it's a non-patch because my source has rather diverged from the CVS
> version, and that's before I ran everything through astyle to apply
> consistent formatting).


Thanks for the bug report and the non-patch.  Looks like an easy fix.
I was concerned that the problem existed in 3.1.6 as well, but I've just
confirmed by carefully examining the code that it does the right thing.
The bug was introduced in 3.2 during the move to regex support for
robots.txt.

Of course, this raises the question of why regex support is even
being added for this.  Doesn't the original standard require exact
matches?  The FAQ states that wildcards are not supported.
(info.webcrawler.com seems to be gone now, but the standard seems to
be hosted at www.robotstxt.org now.  We need to update our links!)

I did find a Draft version 2.0 standard for robots.txt, but it doesn't
clearly define the format to use for regular expressions and says that
pending discussion, version 2.0 semantics may not be implemented.  So,
why are we, and are we even doing it properly?

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] More robots.txt stuff

Reply via email to