[jira] [Comment Edited] (SOLR-12743) Memory leak introduced in Solr 7.3.0

2019-02-04 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16759790#comment-16759790
 ] 

Markus Jelsma edited comment on SOLR-12743 at 2/4/19 11:37 AM:
---

Hello all,

Because i can only reproduce it on production, i only have a limited number of 
tries per day, it takes over an hour to test a minor change and more when i 
need to revert. Here are some new notes:

* it doesn't "appear" to be caused by the metrics part, i took out everything 
inside initializeMetrics(), but the leak persisted;
* i swapped FastLRU for LFU cache, otherwise same settings, the node ran OOM 
within minutes even before the commit got issued;
* no idea what happened, but because Solr can run OOM for no clear reason, 
restarted and tried again, *this time the otherwise leaking reference is 
collected as it should*!

So i finally see a "stable" 7.6 with LFUCache instead of FastLRUCache. To be 
clear, FastLRU does work without leaking, but only with a zero autoWarmCount.

I have no idea what is going on with the warming, the warming code is almost 
identical and i can't see how a SolrIndexSearcher instance would leak with 
FastLRU, but not with LFU. The CacheRegenerator is not leaking the reference, 
nor the calling code in SolrCore seems to be the problem.

I'll keep this single node on 7.6 for now and keep an eye on it.

Thanks!



was (Author: markus17):
Hello all,

Because i can only reproduce it on production, i only have a limited number of 
tries per day, it takes over an hour to test a minor change and more when i 
need to revert. Here are some new notes:

* it doesn't "appear" to be caused by the metrics part, i took out everything 
inside initializeMetrics(), but the leak persisted;
* i swapped FastLRU for LFU cache, otherwise same settings, the node ran OOM 
within minutes even before the commit got issued;
* no idea what happened, but because Solr can run OOM for no clear reason, 
restarted and tried again, this time the otherwise leaking reference is 
collected as it should!

So i finally see a "stable" 7.6 with LFUCache instead of FastLRUCache. To be 
clear, FastLRU does work without leaking, but only with a zero autoWarmCount.

I have no idea what is going on with the warming, the warming code is almost 
identical and i can't see how a SolrIndexSearcher instance would leak with 
FastLRU, but not with LFU. The CacheRegenerator is not leaking the reference, 
nor the calling code in SolrCore seems to be the problem.

I'll keep this single node on 7.6 for now and keep an eye on it.

Thanks!


> Memory leak introduced in Solr 7.3.0
> 
>
> Key: SOLR-12743
> URL: https://issues.apache.org/jira/browse/SOLR-12743
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3, 7.3.1, 7.4
>Reporter: Tomás Fernández Löbbe
>Priority: Critical
> Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest instances:
>         • 

[jira] [Comment Edited] (SOLR-12743) Memory leak introduced in Solr 7.3.0

2019-01-31 Thread Michael Gibney (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-12743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16757467#comment-16757467
 ] 

Michael Gibney edited comment on SOLR-12743 at 1/31/19 4:39 PM:


Ah, ok; so I guess looking for "overlapping onDeckSearcher" in logs is not 
productive.

[~markus17], thanks for the extra information! A few more questions/thoughts:
 # Does a thread dump provide any useful information? e.g., if an autowarm (or 
other) thread is blocked somewhere?
 # When the problem manifests, is the service running under load heavy enough 
that inserts/cleanup _could_ potentially monopolize a lock?
 # What are your {{autoCommit}} (and {{autoSoftCommit}}, {{commitWithin}}, 
etc.) settings? Are you also running manual commits?
 # Looking only at the code in {{SolrCore}}, it looks like the only way to get 
"PERFORMANCE WARNING: Overlapping onDeckSearchers" errors in your log is to 
have {{maxWarmingSearchers}} set to > 1. You could try setting this to "2" ... 
it's unlikely to hurt (in fact, unlikely to make a difference, per [~dsmiley]) 
– but there's a remote chance it could provide useful feedback.
 # I see you earlier noted that it's normal that two {{SolrIndexSearchers}} 
should coexist immediately after a commit; so just to clarify, when you say it 
"immediately" leaks a {{SolrIndexSearcher}} instance, you mean it's hanging 
around longer than it should ...


was (Author: mgibney):
Ah, ok; so I guess looking for "overlapping onDeckSearcher" in logs is not 
productive.

[~markus17], thanks for the extra information! A few more questions/thoughts:
 # Does a thread dump provide any useful information? e.g., if an autowarm (or 
other) thread is blocked somewhere?
 # When the problem manifests, is the service running under load heavy enough 
that inserts/cleanup _could_ potentially monopolize a lock?
 # What are your {{autoCommit}} (and {{autoSoftCommit}}, {{commitWithin}}, 
etc.) settings? Are you also running manual commits?
 # Looking only at the code in {{SolrCore}}, it looks like the only way to get 
"PERFORMANCE WARNING: Overlapping onDeckSearchers" errors in your log is to 
have {{maxWarmingSearchers}} set to > 1. You could try setting this to "2" ... 
it's unlikely to hurt (in fact, unlikely to make a difference, per [~dsmiley]) 
– but there's a remote chance it could provide useful feedback.
 # I see you earlier noted that it's normal that two {{SolrIndexSearcher}}s 
should coexist immediately after a commit; so just to clarify, when you say it 
"immediately" leaks a {{SolrIndexSearcher}} instance, you mean it's hanging 
around longer than it should ...

> Memory leak introduced in Solr 7.3.0
> 
>
> Key: SOLR-12743
> URL: https://issues.apache.org/jira/browse/SOLR-12743
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Affects Versions: 7.3, 7.3.1, 7.4
>Reporter: Tomás Fernández Löbbe
>Priority: Critical
> Attachments: SOLR-12743.patch
>
>
> Reported initially by [~markus17]([1], [2]), but other users have had the 
> same issue [3]. Some of the key parts:
> {noformat}
> Some facts:
> * problem started after upgrading from 7.2.1 to 7.3.0;
> * it occurs only in our main text search collection, all other collections 
> are unaffected;
> * despite what i said earlier, it is so far unreproducible outside 
> production, even when mimicking production as good as we can;
> * SortedIntDocSet instances and ConcurrentLRUCache$CacheEntry instances are 
> both leaked on commit;
> * filterCache is enabled using FastLRUCache;
> * filter queries are simple field:value using strings, and three filter query 
> for time range using [NOW/DAY TO NOW+1DAY/DAY] syntax for 'today', 'last 
> week' and 'last month', but rarely used;
> * reloading the core manually frees OldGen;
> * custom URP's don't cause the problem, disabling them doesn't solve it;
> * the collection uses custom extensions for QueryComponent and 
> QueryElevationComponent, ExtendedDismaxQParser and MoreLikeThisQParser, a 
> whole bunch of TokenFilters, and several DocTransformers and due it being 
> only reproducible on production, i really cannot switch these back to 
> Solr/Lucene versions;
> * useFilterForSortedQuery is/was not defined in schema so it was default 
> (true?), SOLR-11769 could be the culprit, i disabled it just now only for the 
> node running 7.4.0, rest of collection runs 7.2.1;
> {noformat}
> {noformat}
> You were right, it was leaking exactly one SolrIndexSearcher instance on each 
> commit. 
> {noformat}
> And from Björn Häuser ([3]):
> {noformat}
> Problem Suspect 1
> 91 instances of "org.apache.solr.search.SolrIndexSearcher", loaded by 
> "org.eclipse.jetty.webapp.WebAppClassLoader @ 0x6807d1048" occupy 
> 1.981.148.336 (38,26%) bytes. 
> Biggest