Re: [htdig-dev] robots.txt URL matching (OK in 3.1.x, bad in 3.2.0b)

Martin Mačok Sun, 27 Oct 2002 13:00:55 -0800

On Sat, Oct 26, 2002 at 10:06:12PM -0500, Geoff Hutchison wrote:

> Instead, I'd suggest using the SourceForge bug tracker for ht://Dig
> http://sourceforge.net/tracker/?atid=104593&group_id=4593&func=browse


OK, I've tried to avoid it ;-) If this doesn't get resolved soon
I will submit it.

> >I found it with wwwoffle cache indexing scripts. htdig 3.1.x worked
> >well but after upgrading to 3.2.0b4-072201 it broke. The cached
> >pages are under "/search/index" directory and "/index" is
> >disallowed. You can see that 3.2.0b rejects "/search/index" in
> >debug output:
> 
> Yes. I can't see anything in particular that would have solved this
> in the meantime (which surprises me since I seem to remember this
> before). For my own benefit, could you confirm that it fails for you
> on the current snapshot?

Hm, I've got sucking slow and expensive dialup here (Czech Republic,
monopolistic phone operator ... you know) so I would like to avoid
downloading extra 2MB ...

Back to topic - I've got ht://Dig 3.2.0b4-072201 source code here and
I tried to fix it after some short time of looking at the code. See
the attachment and review it cause I'm not too familiar with htdig
code internals, this is just a quick-try-hack, but it seems to be
working here but not heavily tested though...

By the way, I think that using regular expressions here is a way too
big hammer for this simple task (i.e. just for testing if one string
is equal to or just an extension of another). Robots.txt is not
defined to contain regular expressions but htdig handles disallow
lines as if they are regexps. Are you sure that won't cause any
problems if somebody puts some "weird" characters in it?

Thanks for you reply and have a nice day

-- 
         Martin Mačok                 http://underground.cz/
   [EMAIL PROTECTED]        http://Xtrmntr.org/ORBman/

diff -urNp htdig-3.2.0b4-072201.orig/htdig/Retriever.cc 
htdig-3.2.0b4-072201/htdig/Retriever.cc
--- htdig-3.2.0b4-072201.orig/htdig/Retriever.cc        2001-07-08 09:14:01.000000000 
+0200
+++ htdig-3.2.0b4-072201/htdig/Retriever.cc     2002-10-27 14:38:08.000000000 +0100
@@ -961,7 +961,7 @@ Retriever::IsValidURL(const String &u)
     //
     URL        testURL((char*)url);
     Server *server = (Server *) servers[testURL.signature()];
-    if (server && server->IsDisallowed(url) != 0)
+    if (server && server->IsDisallowed(testURL.path()) != 0)
       {
        if (debug > 2)
          cout << endl << "   Rejected: forbidden by server robots.txt!";
Binary files htdig-3.2.0b4-072201.orig/htdig/.Retriever.cc.swp and 
htdig-3.2.0b4-072201/htdig/.Retriever.cc.swp differ
diff -urNp htdig-3.2.0b4-072201.orig/htdig/Server.cc 
htdig-3.2.0b4-072201/htdig/Server.cc
--- htdig-3.2.0b4-072201.orig/htdig/Server.cc   2001-05-20 09:13:51.000000000 +0200
+++ htdig-3.2.0b4-072201/htdig/Server.cc        2002-10-27 13:57:32.000000000 +0100
@@ -187,7 +187,7 @@ void Server::robotstxt(Document &doc)
     String     contents = doc.Contents();
     int                length;
     int                pay_attention = 0;
-    String     pattern;
+    String     pattern = "";
     String     myname = config->Find("server", _host.get(), "robotstxt_name");
     int                seen_myname = 0;
     char       *name, *rest;
@@ -277,9 +277,9 @@ void Server::robotstxt(Document &doc)
            if (*rest)
            {
                if (pattern.length())
-                   pattern << '|' << rest;
+                   pattern << "|^" << rest;
                else
-                   pattern = rest;
+                   pattern << '^' << rest;
            }
        }
        //

Re: [htdig-dev] robots.txt URL matching (OK in 3.1.x, bad in 3.2.0b)

Reply via email to