[ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16756284#comment-16756284
 ] 

Michael Gibney commented on SOLR-12743:
---------------------------------------

The patch just attached is a shot in the dark (I can't directly reproduce this 
problem). But I think it's probably a good patch either way, because:

I _was_ able to induce some weird behavior, artificially simulating a 
high-turnover cache environment (lots of inserts) and simultaneously trying to 
execute the {{get[Oldest|Latest]AccessedItems()}} method on 
{{ConcurrentLRUCache}}. This would be akin to what happens to an old cache that 
remains under heavy load (lots of inserts) while a new cache/searcher is being 
warmed (queries for autowarm are retrieved via the 
{{get[Oldest|Latest]AccessedItems()}} methods. 

The heart of the issue I observed is that {{get[Oldest|Latest]AccessedItems()}} 
use {{ReentrantLock.lock()}}, but {{markAndSweep()}} (for cleaning overflow 
entries) uses {{ReentrantLock.tryLock()}}. The latter is evidently much
faster, and by design does not respect the {{fairness=true}} setting on 
{{markAndSweepLock}}. So I was able to create a situation where, with heavy 
enough turnover, {{markAndSweep()}} was called regularly enough that it 
monopolized the lock, starving {{get[Oldest|Latest]AccessedItems()}}.

FWIW, I noticed that the official solr docker image moved from using openjdk 8 
to openjdk 11 in the version interval that seems to have triggered this issue.

I realize that this might fall short as an explanation for this issue, because 
the line of reasoning I'm following here would suggest that autowarming should 
block (not complete), which should \(?\) trigger "Overlapping onDeckSearcher" 
warnings. Also, it seems unlikely (though certainly not impossible) to 
consistently sustain a level of load sufficient to permanently monopolize the 
lock. 

Re: autowarming ... earlier comments are ambiguous wrt autowarm counts. _If_ 
the underlying issue is lock contention, then the _exact_ autowarm count should 
not matter, but I would expect that _disabling_ autowarm (setting to 0) would 
in fact be an effective workaround.

> Memory leak introduced in Solr 7.3.0
> ------------------------------------
>
>                 Key: SOLR-12743
>                 URL: https://issues.apache.org/jira/browse/SOLR-12743
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>    Affects Versions: 7.3, 7.3.1, 7.4
>            Reporter: Tomás Fernández Löbbe
>            Priority: Critical
>         Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest instances:
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6ffd47ea8 - 70.087.272 
> (1,35%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x79ea9c040 - 65.678.264 
> (1,27%) bytes. 
>         • org.apache.solr.search.SolrIndexSearcher @ 0x6855ad680 - 63.050.600 
> (1,22%) bytes. 
> Problem Suspect 2
> 223 instances of "org.apache.solr.util.ConcurrentLRUCache", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.373.110.208 (26,52%) bytes. 
> {noformat}
> More details in the email threads.
> [1] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201804.mbox/%3Czarafa.5ae201c6.2f85.218a781d795b07b1%40mail1.ams.nl.openindex.io%3E]
>  [2] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201806.mbox/%3Czarafa.5b351537.7b8c.647ddc93059f68eb%40mail1.ams.nl.openindex.io%3E]
>  [3] 
> [http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201809.mbox/%3c7b5e78c6-8cf6-42ee-8d28-872230ded...@gmail.com%3E]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to