[jira] [Updated] (SOLR-5986) Don't allow runaway queries from harming Solr cluster health or search performance

Steve Davids (JIRA) Mon, 16 Jun 2014 15:10:26 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Davids updated SOLR-5986:
-------------------------------

    Description: 
The intent of this ticket is to have all distributed search requests stop 
wasting CPU cycles on requests that have already timed out or are so 
complicated that they won't be able to execute. We have come across a case 
where a nasty wildcard query within a proximity clause was causing the cluster 
to enumerate terms for hours even though the query timeout was set to minutes. 
This caused a noticeable slowdown within the system which made us restart the 
replicas that happened to service that one request, the worst case scenario are 
users with a relatively low zk timeout value will have nodes start dropping 
from the cluster due to long GC pauses.

[~amccurry] Built a mechanism into Apache Blur to help with the issue in 
BLUR-142 (see commit comment for code, though look at the latest code on the 
trunk for newer bug fixes).

Solr should be able to either prevent these problematic queries from running by 
some heuristic (possibly estimated size of heap usage) or be able to execute a 
thread interrupt on all query threads once the time threshold is met. This 
issue mirrors what others have discussed on the mailing list: 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E

  was:
The intent of this ticket is to have all distributed search requests stop 
wasting CPU cycles on requests that have already timed out. We have come across 
a case where a nasty wildcard query within a proximity clause was causing the 
cluster to enumerate terms for hours even though the query timeout was set to 
minutes. This caused a noticeable slowdown within the system which made us 
restart the replicas that happened to service that one request.

[~amccurry] Built a mechanism into Apache Blur to help with the issue in 
BLUR-142 (see commit comment for code, though look at the latest code on the 
trunk for newer bug fixes).

Ideally Solr will distribute the timeout request parameter and automatically 
interrupt all query threads once the threshold is met. This issue mirrors what 
others have discussed on the mailing list: 
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E

        Summary: Don't allow runaway queries from harming Solr cluster health 
or search performance  (was: When a query times out all distributed searches 
shouldn't continue on until completion)

As a follow up, we are still experiencing frequent issues with this specific 
issue which is getting more and more frequent. Upon further research it looks 
like this is a somewhat common problem that afflicts various Lucene community 
members. As noted in the description Apache Blur has implemented a mechanism 
for coping but more recently Elastic Search has also implemented their own 
solution which performs an up-front query heap estimation and will pull the 
"circuit breaker" if it exceeds a threshold, thus not allowing the query to 
crash their cluster.

Documentation: 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html#fielddata-circuit-breaker
Ticket: https://github.com/elasticsearch/elasticsearch/issues/2929 & 
https://github.com/elasticsearch/elasticsearch/pull/4261

If anyone has any suggestions on how we can limp by for the time being that 
would also be greatly appreciated (unfortunately our user base needs to keep 
using nested proximity wildcards but willing to have mechanisms in place to a 
kill subset of problematic queries).

> Don't allow runaway queries from harming Solr cluster health or search 
> performance
> ----------------------------------------------------------------------------------
>
>                 Key: SOLR-5986
>                 URL: https://issues.apache.org/jira/browse/SOLR-5986
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Steve Davids
>            Priority: Critical
>             Fix For: 4.9
>
>
> The intent of this ticket is to have all distributed search requests stop 
> wasting CPU cycles on requests that have already timed out or are so 
> complicated that they won't be able to execute. We have come across a case 
> where a nasty wildcard query within a proximity clause was causing the 
> cluster to enumerate terms for hours even though the query timeout was set to 
> minutes. This caused a noticeable slowdown within the system which made us 
> restart the replicas that happened to service that one request, the worst 
> case scenario are users with a relatively low zk timeout value will have 
> nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in 
> BLUR-142 (see commit comment for code, though look at the latest code on the 
> trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running 
> by some heuristic (possibly estimated size of heap usage) or be able to 
> execute a thread interrupt on all query threads once the time threshold is 
> met. This issue mirrors what others have discussed on the mailing list: 
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-5986) Don't allow runaway queries from harming Solr cluster health or search performance

Reply via email to