I recently came accross something that didn't seem right to me. I'm using "WWW::RobotRules::AnyDBM_File", but the below sample script will return the same thing.

The URL I tested is:
http://www.midwestoffroad.com/

The robots.txt reads:

User-agent: *
Disallow: admin.php
Disallow: error.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
User-agent: Baidu
Disallow: /

RobotRules returns that the URL is denied by robots.txt which should not be the case. A stripped script is:

use WWW::RobotRules;
my $rules = WWW::RobotRules->new('MOMspider/1.0');

use LWP::Simple qw(get);

my $url = "http://www.midwestoffroad.com/robots.txt";;
my $robots_txt = get $url;
$rules->parse($url, $robots_txt) if defined $robots_txt;

if($rules->allowed('http://www.midwestoffroad.com/')) {
  print qq!Allowed by robots.txt\n\n!;
}else {
  print qq!Denied by robots.txt\n\n!;
}
exit();

Which prints out "Denied by robots.txt".

Thanks




Reply via email to