According to Jim Cole:
> On Friday, June 27, 2003, at 02:00 PM, Patrick Robinson wrote:
> > I just installed htdig-3.2.0b4-20030622, and discovered that it's not 
> > correctly handling Disallow: patterns from my robots.txt file.  (I'm 
> > hoping this is the correct list to post this!)
> >
> > I have these lines in my robots.txt:
> > User-agent: *
> > Disallow: /WebObjects/
> >
> > In my config file, I do NOT exclude /cgi-bin/ via exclude_urls.  
> > However, when I rundig -vvv, it tells me that URLs like the following 
> > are rejected due to being "forbidden by server robots.txt":
> > href: http://www.mysite.edu/cgi-bin/WebObjects/blah/blah/blah
> 
> I am seeing the same behavior in the current CVS code. As currently 
> implemented, URL's are being checked for any occurrence of the disallow 
> string, without regard to location within the URL.
> 
> > This shouldn't happen.  It should only be rejecting URLs *starting* 
> > with "/WebObjects/" (at least, that's my interpretation of what I read 
> > at http://www.robotstxt.org/wc/norobots.html).
> 
> I agree that this behavior does not seem to match that specified by the 
> standard.

Correct.  This has been reported before, and possible solutions discussed,
but nobody followed through with implementing one.

> > I never had this problem in 3.1.6.  Has something changed?
> 
> I believe some of the related code changed with the introduction of new 
> regex support. As it currently stands, the code is comparing the 
> disallow against the full URL, rather than just the path, and it is not 
> anchoring the comparison.

Correct again.  Either anchoring the comparison, or going back to using
StringMatch instead of Regex, is the solution, but in either case, you
must be sure you're always looking at only the path portion of the URL,
not the full URL as the 3.2 code does now.

> In case you want to give it a try, I am attaching a patch that seems to 
> correct the behavior of the robots code. I won't claim to have any deep 
> insight into this part of the code, so no guarantees and all of that.

The problem with that patch is it seems to miss the case of IsDisallowed
called by Server::push(), so there it would end up checking the full URL
against the anchored patterns for the path, and you'd never get a match.
Unless the tests in Retriever::IsValidURL() pre-screen all cases before
attempting a push(), I think some disallowed URLs could slip through
the cracks.

A more self-contained fix is below.  It sidesteps the whole issue by
making a regex pattern that can match the whole URL, so minimal code
changes are needed.  I don't know how efficient this ends up being,
though.  I also haven't tested this beyond making sure the full pattern
works in egrep, so please test this patch carefully before using.
I'll await feedback before committing it.

--- htdig/Server.cc.orig        2003-06-24 15:40:11.000000000 -0500
+++ htdig/Server.cc     2003-07-08 17:16:18.000000000 -0500
@@ -316,9 +316,13 @@ void Server::robotstxt(Document &doc)
            if (*rest)
            {
                if (pattern.length())
-                   pattern << '|' << rest;
-               else
-                   pattern = rest;
+                   pattern << '|';
+               while (*rest)
+               {
+                   if (strchr("^.[$()|*+?{\\", *rest))
+                       pattern << '\\';
+                   pattern << *rest++;
+               }
            }
        }
        //
@@ -332,7 +336,9 @@ void Server::robotstxt(Document &doc)
     if (debug > 1)
        cout << "Pattern: " << pattern << endl;
                
-    _disallow.set(pattern, config->Boolean("case_sensitive"));
+    String     fullpatt = "^[^:]*://[^/]*(";
+    fullpatt << pattern << ')';
+    _disallow.set(fullpatt, config->Boolean("case_sensitive"));
 }
 
 

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps
_______________________________________________
ht://Dig Developer mailing list:
[EMAIL PROTECTED]
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-dev

Reply via email to