On Sat, 2008-11-15 at 14:07 +0100, Oleg Kalnichevski wrote:
> Folks,
> 
> What is the reason for robots exclusion processing being handled at the
> protocol level rather than at the droid level? Presently HttpProtocol
> attempts to retrieve robot.txt for _each_ and _every_ URI it is
> processing and then discards robot.txt rules when finished. This does
> not sound right to me. Am I missing something? 

The principal reason was that robots.txt is designed originally to be
read in http. 

http://www.robotstxt.org/orig.html

However as you point out it is not the most efficient way to use it.

> I can't help thinking robots.txt processing belongs to the Droid level.
> CrawlingDroid should retrieve robot.txt once at the beginning of the run
> and then re-use it for all subsequent requests for the same URI space.

+1

> It should maintain a notion of a session and cache robots.txt rules for
> all URIs outside the initial URI space for the same run. At the same
> time HttpProtocol should remain stateless (should not maintain any state
> information that could interfere with individual sessions)
> 
> What do you think?

That sounds awesome Oleg since we can reduce processing time quite a lot
for the robot.txt.

Making me think whether nobert is not reflecting just another filter as
the url-regex filter for outlinks. We already have TaskValidator maybe
it makes sense to have a Validator interface as super.

salu2

> Oleg
> 
> 
-- 
Thorsten Scherler <thorsten.at.apache.org>
Open Source Java <consulting, training and solutions>

Reply via email to