On Fri, 2010-06-04 at 16:03 -0700, Josh Gordineer wrote: > Was running through the mail archives looking for a plug-in solution > for the robots.txt. Seems like there are some libraries that exist, > what is the recommended way of plugging it into HttpClient and making > sure it gets invoked for direct and redirected requests (and any other > permutation)? > > > Thanks in advance.. > > Josh >
Hi Josh Apache Droids (incubating) provides a robot.txt parser and also can be used as an example of using HttpClient within a web crawler. http://incubator.apache.org/droids/api/droids-norobots/ http://incubator.apache.org/droids/api/droids-core/ As far as direct requests go, if a resource is excluded by a rule in robots.txt, the crawler is simply not meant to issue a request for that resource. For redirected requests you could employ a custom redirect strategy. I suspect, though, it might be easier just disable automatic redirect handling and handle them manually. Hope this helps Oleg > > > Ken Krugler wrote: > > Hi Douglas, > > On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote: > > > Not sure if this is off topic or not, does anybody have any recommendations > on respecting robots.txt when using HttpClient? > > You could check out the SimpleRobotRules class in the Bixo project. This is > used in conjunction with the SimpleHttpClient class, which wraps HttpClient > 4.0. > > SimpleRobotRules parses the robots.txt file and generates a rule set that > can be used when filtering URLs. > > It also extracts the crawl delay, if present. > > > Actually applying this information correctly is a bit tricky - e.g. you need > to look for the robots.txt file using the protocol+hostname+port (e.g. > https://subdomain.domain.com:8000 as a contrived example), but you want to > use the crawl delay to limit requests by IP address, in case multiple > sub-domains (or even domains) resolve to the same server. > > -- Ken > > > Another option may be the NoRobots parser from Apache Droids > http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/ > > Oleg > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
