I'd like to use the RobotUA in my software and get the protection it
provides, but I also want to be able to access all sorts of URLs
including non-http URLs.
Unfortunately the interface to the RobotRules module means that any
time a non HTTP URL is requested through the RobotUA, it's handled in
a wierd or occasionally broken way.
If the URL is e.g. an FTP url or any other that is treated as a server
URI by URI.pm then the robot rules will be applied. If the other URL
is something which doesn't support the URI::host and URI::port methods
then RobotRules fails.
an example of what I am talking about can be seen by running
perl -MLWP::RobotUA -MHTTP::Request -MData::Dumper -e
'$ua=LWP::RobotUA->new("tst","tst"); $rq=HTTP::Request->new(GET =>
"mailto:test\@test") ; $rs=$ua->request($rq); print Dumper($rs)'
which gives
Can't locate object method "path_query" via package "URI::mailto" at
/usr/lib/perl5/site_perl/5.6.0/WWW/RobotRules.pm line 193.
compared to
perl -MLWP::UserAgent -MHTTP::Request -MData::Dumper -e
'$ua=LWP::UserAgent->new(); $rq=HTTP::Request->new(GET =>
"mailto:test\@test") ; $rs=$ua->request($rq); print Dumper($rs)'
which works fine, giving response code 400 'Library does not allow
method GET for \'mailto:\' URLs'
In my opinion RobotRules / RobotUA should correctly handle any URL
whatsoever, possibly with the user able to configure whether or not
access is allowed. This would make it a much better drop in
replacement for UserAgent.
I wrote NoStopRobot which I am using in my LinkController distribution
which works around this simply by only passing on specific URLs to the
robot rules, but I wonder if this is the correct solution?
Michael