aseek-devel  

[aseek-devel] Another parse.cpp patch

Jens Thoms Toerring
Thu, 04 Sep 2003 06:13:36 -0700

Hi,

  I found a problem with the handling of robots.txt files in the
FindRobots() function in parse.cpp. If I don't misunderstand the
standard for robot exclusion completely and entry in robots.txt
for http://www.foo.bar like

Disallow: /foo.html

should forbid robots to index the URL

http://www.foo.bar/foo.html

but *not*

http://www.foo.bar/xxx/yyy/foo.html

Unfortunately, this is what happens currently because in FindRobots()
the path is com/pared unconditonally with entries in robots and not
just the start of the path. Below is a patch to rectify the problem.
(I also added a bit to get debug output in case of denied access
due to robots.txt.)
                                 Regards, Jens
-- 
 Freie Universitaet Berlin     Jens Thoms Toerring
 Universitaetsbibliothek
 Webteam                       Tel: 0049 30 838 56055
 Garystrasse 39                Fax: 0049 30 838 53738
 14195 Berlin                  e-mail: [EMAIL PROTECTED]


--- aspseek-orig/src/parse.cpp  2003-08-27 13:06:46.000000000 +0200
+++ aspseek-my/src/parse.cpp    2003-09-04 15:03:21.000000000 +0200
@@ -96,8 +96,10 @@
        sprintf(fpath, "%s%s", path, name);
        for (CStringVector::iterator s = v.begin(); s != v.end(); s++)
        {
-               if (strstr(fpath, s->c_str()))
+               if ( ! strncmp( fpath, s->c_str( ),s->length( ) ) )
                {
+                       logger.log( CAT_ALL, L_DEBUG, "Denying %s in %s (because of 
%s)\n",
+                                           fpath, host, s->c_str( ) );
                        return 1;
                }
        }