Was running through the mail archives looking for a plug-in solution
for the robots.txt.  Seems like there are some libraries that exist,
what is the recommended way of plugging it into HttpClient and making
sure it gets invoked for direct and redirected requests (and any other
permutation)?


Thanks in advance..

Josh



Ken Krugler wrote:

Hi Douglas,

On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote:


Not sure if this is off topic or not, does anybody have any recommendations
on respecting robots.txt when using HttpClient?

You could check out the SimpleRobotRules class in the Bixo project. This is
used in conjunction with the SimpleHttpClient class, which wraps HttpClient
4.0.

SimpleRobotRules parses the robots.txt file and generates a rule set that
can be used when filtering URLs.

It also extracts the crawl delay, if present.


Actually applying this information correctly is a bit tricky - e.g. you need
to look for the robots.txt file using the protocol+hostname+port (e.g.
https://subdomain.domain.com:8000 as a contrived example), but you want to
use the crawl delay to limit requests by IP address, in case multiple
sub-domains (or even domains) resolve to the same server.

-- Ken


Another option may be the NoRobots parser from Apache Droids
http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/

Oleg

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to