[ https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935490#action_12935490 ]
Paul Rogalinski commented on DROIDS-105: ---------------------------------------- about real-life cache sizes: i am using a LRU map here, so in theory it should be sufficient to use a cache size of 2 in order to prevent the frequent hitting of the robotx.txt file (one robots request per potential URL request). I don't think many application will benefit from cache sizes beyond 100 when crawling the web, the TaskQueue will usually take care of filtering already visited URLs. So, while the caching implementation offers an easy and generic solution to the described problem, it should actually only be necessary to cache the robots.txt request per unique host. About the DNS-to-IP JVM Caching - I see your point here ... somewhat. If long crawls become a problem due to excessive caching, the implementing side should make use of: java.security.Security.setProperty("networkaddress.cache.ttl" ,TTL); A TTL mechanism for very large LRU Cache-Sizes (robots.txt of a particular domain never gets removed from the cache) might also become necessary for the URL-to-Content cache. Setting the cache-size to a value around 100, heck, even 100.000 if you are about to crawl 10.000.000 sites or more (I am) should not become an issue though. But I still agree, the implementing side should be very aware of those implications. I was also thinking about implementing cache on a lower level, such as the HttpClient which would get a bit more challenging and complicated to implement. The proposed solution above was on the other hand good enough for my requirements. > missing caching for robots.txt > ------------------------------ > > Key: DROIDS-105 > URL: https://issues.apache.org/jira/browse/DROIDS-105 > Project: Droids > Issue Type: Improvement > Components: core > Reporter: Paul Rogalinski > Attachments: Caching-Support-and-Robots_txt-fix.patch, > CachingContentLoader.java > > > the current implementation of the HttpClient will not cache any requests to > the robots.txt file. While using the CrawlingWorker this will result in 2 > requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 > URLs the target server would get 6 requests for the robots.txt. > unfortunately the contentLoader is made final in HttpProtocol, so there is no > possibility to replace it with a caching Protocol like that one you'll find > in the attachment. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.