Ken Krugler wrote:
Hi Douglas,

On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote:

Not sure if this is off topic or not, does anybody have any recommendations on respecting robots.txt when using HttpClient?

You could check out the SimpleRobotRules class in the Bixo project. This is used in conjunction with the SimpleHttpClient class, which wraps HttpClient 4.0.

SimpleRobotRules parses the robots.txt file and generates a rule set that can be used when filtering URLs.

It also extracts the crawl delay, if present.

Actually applying this information correctly is a bit tricky - e.g. you need to look for the robots.txt file using the protocol+hostname+port (e.g. https://subdomain.domain.com:8000 as a contrived example), but you want to use the crawl delay to limit requests by IP address, in case multiple sub-domains (or even domains) resolve to the same server.

-- Ken


Another option may be the NoRobots parser from Apache Droids

http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/

Oleg

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to