jnioche opened a new pull request, #1854:
URL: https://github.com/apache/stormcrawler/pull/1854

   It has been observed on large crawls with OpenSearch as a backend that the 
StatusUpdaterBolt becomes the bottleneck after a while. The reason for this has 
to do with DISCOVERED URLs that do not hit the cache because it is full. They 
get sent to OpenSearch, which can trigger a conflict and the whole batch is 
resent. This adds to the traffic to OpenSearch and also means that the tuples 
are not acked successfully.
   
   Instead of specifying an arbitrary max size for the cache, this PR makes use 
of  `softValues` so that entries are removed from the cache is memory is 
needed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to