Jens Thoms Toerring
Thu, 04 Sep 2003 06:13:36 -0700
Hi, I found a problem with the handling of robots.txt files in the FindRobots() function in parse.cpp. If I don't misunderstand the standard for robot exclusion completely and entry in robots.txt for http://www.foo.bar like Disallow: /foo.html should forbid robots to index the URL http://www.foo.bar/foo.html but *not* http://www.foo.bar/xxx/yyy/foo.html Unfortunately, this is what happens currently because in FindRobots() the path is com/pared unconditonally with entries in robots and not just the start of the path. Below is a patch to rectify the problem. (I also added a bit to get debug output in case of denied access due to robots.txt.) Regards, Jens -- Freie Universitaet Berlin Jens Thoms Toerring Universitaetsbibliothek Webteam Tel: 0049 30 838 56055 Garystrasse 39 Fax: 0049 30 838 53738 14195 Berlin e-mail: [EMAIL PROTECTED] --- aspseek-orig/src/parse.cpp 2003-08-27 13:06:46.000000000 +0200 +++ aspseek-my/src/parse.cpp 2003-09-04 15:03:21.000000000 +0200 @@ -96,8 +96,10 @@ sprintf(fpath, "%s%s", path, name); for (CStringVector::iterator s = v.begin(); s != v.end(); s++) { - if (strstr(fpath, s->c_str())) + if ( ! strncmp( fpath, s->c_str( ),s->length( ) ) ) { + logger.log( CAT_ALL, L_DEBUG, "Denying %s in %s (because of %s)\n", + fpath, host, s->c_str( ) ); return 1; } }