I'd like to use the RobotUA in my software and get the protection it
provides, but I also want to be able to access all sorts of URLs
including non-http URLs.

Unfortunately the interface to the RobotRules module means that any
time a non HTTP URL is requested through the RobotUA, it's handled in
a wierd or occasionally broken way.

If the URL is e.g. an FTP url or any other that is treated as a server
URI by URI.pm then the robot rules will be applied.  If the other URL
is something which doesn't support the URI::host and URI::port methods
then RobotRules fails.  

an example of what I am talking about can be seen by running

perl -MLWP::RobotUA -MHTTP::Request -MData::Dumper -e
'$ua=LWP::RobotUA->new("tst","tst"); $rq=HTTP::Request->new(GET =>
"mailto:test\@test";) ; $rs=$ua->request($rq); print Dumper($rs)'

which gives

Can't locate object method "path_query" via package "URI::mailto" at
/usr/lib/perl5/site_perl/5.6.0/WWW/RobotRules.pm line 193.

compared to

perl -MLWP::UserAgent -MHTTP::Request -MData::Dumper -e
'$ua=LWP::UserAgent->new(); $rq=HTTP::Request->new(GET =>
"mailto:test\@test";) ; $rs=$ua->request($rq); print Dumper($rs)'

which works fine, giving response code 400 'Library does not allow
method GET for \'mailto:\' URLs'

In my opinion RobotRules / RobotUA should correctly handle any URL
whatsoever, possibly with the user able to configure whether or not
access is allowed.  This would make it a much better drop in
replacement for UserAgent.  

I wrote NoStopRobot which I am using in my LinkController distribution
which works around this simply by only passing on specific URLs to the
robot rules, but I wonder if this is the correct solution?

   Michael

Reply via email to