jnioche opened a new pull request, #1854: URL: https://github.com/apache/stormcrawler/pull/1854
It has been observed on large crawls with OpenSearch as a backend that the StatusUpdaterBolt becomes the bottleneck after a while. The reason for this has to do with DISCOVERED URLs that do not hit the cache because it is full. They get sent to OpenSearch, which can trigger a conflict and the whole batch is resent. This adds to the traffic to OpenSearch and also means that the tuples are not acked successfully. Instead of specifying an arbitrary max size for the cache, this PR makes use of `softValues` so that entries are removed from the cache is memory is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
