Was running through the mail archives looking for a plug-in solution for the robots.txt. Seems like there are some libraries that exist, what is the recommended way of plugging it into HttpClient and making sure it gets invoked for direct and redirected requests (and any other permutation)?
Thanks in advance.. Josh Ken Krugler wrote: Hi Douglas, On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote: Not sure if this is off topic or not, does anybody have any recommendations on respecting robots.txt when using HttpClient? You could check out the SimpleRobotRules class in the Bixo project. This is used in conjunction with the SimpleHttpClient class, which wraps HttpClient 4.0. SimpleRobotRules parses the robots.txt file and generates a rule set that can be used when filtering URLs. It also extracts the crawl delay, if present. Actually applying this information correctly is a bit tricky - e.g. you need to look for the robots.txt file using the protocol+hostname+port (e.g. https://subdomain.domain.com:8000 as a contrived example), but you want to use the crawl delay to limit requests by IP address, in case multiple sub-domains (or even domains) resolve to the same server. -- Ken Another option may be the NoRobots parser from Apache Droids http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/ Oleg --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
