[ 
https://issues.apache.org/jira/browse/DROIDS-105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935490#action_12935490
 ] 

Paul Rogalinski commented on DROIDS-105:
----------------------------------------

about real-life cache sizes: 

i am using a LRU map here, so in theory it should be sufficient to use a cache 
size of 2 in order to prevent the frequent hitting of the robotx.txt file (one 
robots request per potential URL request). I don't think many application will 
benefit from cache sizes beyond 100 when crawling the web, the TaskQueue will 
usually take care of filtering already visited URLs. 

So, while the caching implementation offers an easy and generic solution to the 
described problem, it should actually only be necessary to cache the robots.txt 
request per unique host. 

About the DNS-to-IP JVM Caching - I see your point here ... somewhat. If long 
crawls become a problem due to excessive caching, the implementing side should 
make use of:

java.security.Security.setProperty("networkaddress.cache.ttl" ,TTL);

A TTL mechanism for very large LRU Cache-Sizes (robots.txt of a particular 
domain never gets removed from the cache) might also become necessary for the 
URL-to-Content cache. Setting the cache-size to a value around 100, heck, even 
100.000 if you are about to crawl 10.000.000 sites or more (I am) should not 
become an issue though. But I still agree, the implementing side should be very 
aware of those implications. 

I was also thinking about implementing cache on a lower level, such as the 
HttpClient which would get a bit more challenging and complicated to implement. 
The proposed solution above was on the other hand good enough for my 
requirements.

> missing caching for robots.txt
> ------------------------------
>
>                 Key: DROIDS-105
>                 URL: https://issues.apache.org/jira/browse/DROIDS-105
>             Project: Droids
>          Issue Type: Improvement
>          Components: core
>            Reporter: Paul Rogalinski
>         Attachments: Caching-Support-and-Robots_txt-fix.patch, 
> CachingContentLoader.java
>
>
> the current implementation of the HttpClient will not cache any requests to 
> the robots.txt file. While using the CrawlingWorker this will result in 2 
> requests to the robots.txt (HEAD + GET) per crawled URL. So when crawling 3 
> URLs the target server would get 6 requests for the robots.txt.
> unfortunately the contentLoader is made final in HttpProtocol, so there is no 
> possibility to replace it with a caching Protocol like that one you'll find 
> in the attachment.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to