On Fri, 2010-06-04 at 16:03 -0700, Josh Gordineer wrote:
> Was running through the mail archives looking for a plug-in solution
> for the robots.txt.  Seems like there are some libraries that exist,
> what is the recommended way of plugging it into HttpClient and making
> sure it gets invoked for direct and redirected requests (and any other
> permutation)?
> 
> 
> Thanks in advance..
> 
> Josh
> 

Hi Josh

Apache Droids (incubating) provides a robot.txt parser and also can be
used as an example of using HttpClient within a web crawler.

http://incubator.apache.org/droids/api/droids-norobots/
http://incubator.apache.org/droids/api/droids-core/

As far as direct requests go, if a resource is excluded by a rule in
robots.txt, the crawler is simply not meant to issue a request for that
resource. For redirected requests you could employ a custom redirect
strategy. I suspect, though, it might be easier just disable automatic
redirect handling and handle them manually.

Hope this helps

Oleg    



> 
> 
> Ken Krugler wrote:
> 
> Hi Douglas,
> 
> On Feb 25, 2010, at 8:54am, Douglas Ferguson wrote:
> 
> 
> Not sure if this is off topic or not, does anybody have any recommendations
> on respecting robots.txt when using HttpClient?
> 
> You could check out the SimpleRobotRules class in the Bixo project. This is
> used in conjunction with the SimpleHttpClient class, which wraps HttpClient
> 4.0.
> 
> SimpleRobotRules parses the robots.txt file and generates a rule set that
> can be used when filtering URLs.
> 
> It also extracts the crawl delay, if present.
> 
> 
> Actually applying this information correctly is a bit tricky - e.g. you need
> to look for the robots.txt file using the protocol+hostname+port (e.g.
> https://subdomain.domain.com:8000 as a contrived example), but you want to
> use the crawl delay to limit requests by IP address, in case multiple
> sub-domains (or even domains) resolve to the same server.
> 
> -- Ken
> 
> 
> Another option may be the NoRobots parser from Apache Droids
> http://svn.apache.org/repos/asf/incubator/droids/trunk/droids-norobots/
> 
> Oleg
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to