Re: robots.txt

Oleg Kalnichevski Thu, 25 Feb 2010 11:25:34 -0800

Ken Krugler wrote:

Hi Douglas,
On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote:
Not sure if this is off topic or not, does anybody have anyrecommendations on respecting robots.txt when using HttpClient?
You could check out the SimpleRobotRules class in the Bixo project. Thisis used in conjunction with the SimpleHttpClient class, which wrapsHttpClient 4.0.
SimpleRobotRules parses the robots.txt file and generates a rule setthat can be used when filtering URLs.
It also extracts the crawl delay, if present.
Actually applying this information correctly is a bit tricky - e.g. youneed to look for the robots.txt file using the protocol+hostname+port(e.g. https://subdomain.domain.com:8000 as a contrived example), but youwant to use the crawl delay to limit requests by IP address, in casemultiple sub-domains (or even domains) resolve to the same server.
-- Ken


Another option may be the NoRobots parser from Apache Droids

http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/

Oleg

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: robots.txt

Reply via email to