[ 
https://issues.apache.org/jira/browse/SOLR-13790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941213#comment-16941213
 ] 

Andrzej Bialecki commented on SOLR-13790:
-----------------------------------------

This patch is a work in progress - it fixes the error described above but it 
also tries to fix an existential problem in LRUStatsCache - namely, as it is 
now it would always send requests for fetching stats (thus adding a round-trip 
to every query), even for repeated queries, consequently defeating the point of 
LRU caching.

Changes in this patch:
* consistenly use shard name instead of the full shard URL lists as caching 
keys, both in SolrCloud mode and in standalone distributed mode
* optimized serialization of stats in order to minimize the size of data and to 
prevent serialization errors when terms contain separators or url-unsafe 
characters
* added SolrCloud unit tests, still need much improvement
* added some logic in LRUStatsCache that tries to avoid sending a stats request 
if all global data is already available in cache. This part is a little bit 
shaky but I don't have any better idea at the moment how to address this 
problem. Basically, it rewrites a query locally to see if there are any missing 
stats to be fetched - but the answer "none" is not 100% fool-proof because 
queries may be rewritten differently based on the available terms and fields in 
the local vs. remote index. The code tries to fix it post-factum by detecting 
missing global stats and forcing a fetch+cache of the missing stats with the 
next request.

> LRUStatsCache size explosion
> ----------------------------
>
>                 Key: SOLR-13790
>                 URL: https://issues.apache.org/jira/browse/SOLR-13790
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.7.2, 8.2, 8.3
>            Reporter: Andrzej Bialecki
>            Assignee: Andrzej Bialecki
>            Priority: Critical
>             Fix For: 7.7.3, 8.3
>
>         Attachments: SOLR-13790.patch
>
>
> On a sizeable cluster with multi-shard multi-replica collections, when 
> {{LRUStatsCache}} was in use we encountered excessive memory usage, which 
> consequently led to severe performance problems.
> On a closer examination of the heapdumps it became apparent that when 
> {{LRUStatsCache.addToPerShardTermStats}} is called it creates instances of 
> {{FastLRUCache}} using the passed {{shard}} argument - however, the value of 
> this argument is not a simple shard name but instead it's a randomly ordered 
> list of ALL replica URLs for this shard.
> As a result, due to the combinatoric number of possible keys, over time the 
> map in {{LRUStatsCache.perShardTemStats}} grew to contain ~2 mln entries...
> The fix seems to be simply to extract the shard name and cache using this 
> name instead of the full string value of the {{shard}} parameter. Existing 
> unit tests also need much improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to