On Sat, 2008-11-15 at 14:07 +0100, Oleg Kalnichevski wrote: > Folks, > > What is the reason for robots exclusion processing being handled at the > protocol level rather than at the droid level? Presently HttpProtocol > attempts to retrieve robot.txt for _each_ and _every_ URI it is > processing and then discards robot.txt rules when finished. This does > not sound right to me. Am I missing something?
The principal reason was that robots.txt is designed originally to be read in http. http://www.robotstxt.org/orig.html However as you point out it is not the most efficient way to use it. > I can't help thinking robots.txt processing belongs to the Droid level. > CrawlingDroid should retrieve robot.txt once at the beginning of the run > and then re-use it for all subsequent requests for the same URI space. +1 > It should maintain a notion of a session and cache robots.txt rules for > all URIs outside the initial URI space for the same run. At the same > time HttpProtocol should remain stateless (should not maintain any state > information that could interfere with individual sessions) > > What do you think? That sounds awesome Oleg since we can reduce processing time quite a lot for the robot.txt. Making me think whether nobert is not reflecting just another filter as the url-regex filter for outlinks. We already have TaskValidator maybe it makes sense to have a Validator interface as super. salu2 > Oleg > > -- Thorsten Scherler <thorsten.at.apache.org> Open Source Java <consulting, training and solutions>
