Ken Krugler wrote:
Hi Douglas,
On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote:
Not sure if this is off topic or not, does anybody have any
recommendations on respecting robots.txt when using HttpClient?
You could check out the SimpleRobotRules class in the Bixo project. This
is used in conjunction with the SimpleHttpClient class, which wraps
HttpClient 4.0.
SimpleRobotRules parses the robots.txt file and generates a rule set
that can be used when filtering URLs.
It also extracts the crawl delay, if present.
Actually applying this information correctly is a bit tricky - e.g. you
need to look for the robots.txt file using the protocol+hostname+port
(e.g. https://subdomain.domain.com:8000 as a contrived example), but you
want to use the crawl delay to limit requests by IP address, in case
multiple sub-domains (or even domains) resolve to the same server.
-- Ken
Another option may be the NoRobots parser from Apache Droids
http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/
Oleg
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]