Hi,

Attached is a patch for two WWW::RobotRules bugs:

1. If I use <http://www.htmlhelp.com:80/robots.txt> as the robot_txt_uri,
WWW::RobotRules will not disallow access to
<http://www.htmlhelp.com/award/>, but it will disallow access to
<http://www.htmlhelp.com:80/award/>.  My patched version compares the host
and port instead of just the authority, so that both the /award/ URIs are
disallowed.

2. If a robots.txt has

User-agent: WDG_SiteValidator
Disallow: /foo

and my robot uses

User-Agent: WDG_SiteValidator/1.2.5

then /foo is not disallowed.  The substring comparison in the is_me method
is looking for "WDG_SiteValidator/1.2.5" within "WDG_SiteValidator"
instead of the other way around.

The patch is against

# $Id: RobotRules.pm,v 1.21 2000/04/07 20:17:54 gisle Exp $

-- 
Liam Quinn
--- RobotRules.pm.orig  Sat Apr 22 22:43:38 2000
+++ RobotRules.pm       Fri Apr 20 12:28:29 2001
@@ -83,7 +83,7 @@
 sub parse {
     my($self, $robot_txt_uri, $txt, $fresh_until) = @_;
     $robot_txt_uri = URI->new("$robot_txt_uri");
-    my $netloc = $robot_txt_uri->authority;
+    my $netloc = $robot_txt_uri->host . ":" . $robot_txt_uri->port;
 
     $self->clear_rules($netloc);
     $self->fresh_until($netloc, $fresh_until || (time + 365*24*3600));
@@ -173,7 +173,7 @@
 sub is_me {
     my($self, $ua) = @_;
     my $me = $self->agent;
-    return index(lc($ua), lc($me)) >= 0;
+    return index(lc($me), lc($ua)) >= 0;
 }
 
 =item $rules->allowed($uri)
@@ -185,7 +185,7 @@
 sub allowed {
     my($self, $uri) = @_;
     $uri = URI->new("$uri");
-    my $netloc = $uri->authority;
+    my $netloc = $uri->host . ":" . $uri->port;
 
     my $fresh_until = $self->fresh_until($netloc);
     return -1 if !defined($fresh_until) || $fresh_until < time;

Reply via email to