Hi,
Attached is a patch for two WWW::RobotRules bugs:
1. If I use <http://www.htmlhelp.com:80/robots.txt> as the robot_txt_uri,
WWW::RobotRules will not disallow access to
<http://www.htmlhelp.com/award/>, but it will disallow access to
<http://www.htmlhelp.com:80/award/>. My patched version compares the host
and port instead of just the authority, so that both the /award/ URIs are
disallowed.
2. If a robots.txt has
User-agent: WDG_SiteValidator
Disallow: /foo
and my robot uses
User-Agent: WDG_SiteValidator/1.2.5
then /foo is not disallowed. The substring comparison in the is_me method
is looking for "WDG_SiteValidator/1.2.5" within "WDG_SiteValidator"
instead of the other way around.
The patch is against
# $Id: RobotRules.pm,v 1.21 2000/04/07 20:17:54 gisle Exp $
--
Liam Quinn
--- RobotRules.pm.orig Sat Apr 22 22:43:38 2000
+++ RobotRules.pm Fri Apr 20 12:28:29 2001
@@ -83,7 +83,7 @@
sub parse {
my($self, $robot_txt_uri, $txt, $fresh_until) = @_;
$robot_txt_uri = URI->new("$robot_txt_uri");
- my $netloc = $robot_txt_uri->authority;
+ my $netloc = $robot_txt_uri->host . ":" . $robot_txt_uri->port;
$self->clear_rules($netloc);
$self->fresh_until($netloc, $fresh_until || (time + 365*24*3600));
@@ -173,7 +173,7 @@
sub is_me {
my($self, $ua) = @_;
my $me = $self->agent;
- return index(lc($ua), lc($me)) >= 0;
+ return index(lc($me), lc($ua)) >= 0;
}
=item $rules->allowed($uri)
@@ -185,7 +185,7 @@
sub allowed {
my($self, $uri) = @_;
$uri = URI->new("$uri");
- my $netloc = $uri->authority;
+ my $netloc = $uri->host . ":" . $uri->port;
my $fresh_until = $self->fresh_until($netloc);
return -1 if !defined($fresh_until) || $fresh_until < time;