[
https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Steve Davids updated SOLR-5986:
-------------------------------
Description:
The intent of this ticket is to have all distributed search requests stop
wasting CPU cycles on requests that have already timed out or are so
complicated that they won't be able to execute. We have come across a case
where a nasty wildcard query within a proximity clause was causing the cluster
to enumerate terms for hours even though the query timeout was set to minutes.
This caused a noticeable slowdown within the system which made us restart the
replicas that happened to service that one request, the worst case scenario are
users with a relatively low zk timeout value will have nodes start dropping
from the cluster due to long GC pauses.
[~amccurry] Built a mechanism into Apache Blur to help with the issue in
BLUR-142 (see commit comment for code, though look at the latest code on the
trunk for newer bug fixes).
Solr should be able to either prevent these problematic queries from running by
some heuristic (possibly estimated size of heap usage) or be able to execute a
thread interrupt on all query threads once the time threshold is met. This
issue mirrors what others have discussed on the mailing list:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E
was:
The intent of this ticket is to have all distributed search requests stop
wasting CPU cycles on requests that have already timed out. We have come across
a case where a nasty wildcard query within a proximity clause was causing the
cluster to enumerate terms for hours even though the query timeout was set to
minutes. This caused a noticeable slowdown within the system which made us
restart the replicas that happened to service that one request.
[~amccurry] Built a mechanism into Apache Blur to help with the issue in
BLUR-142 (see commit comment for code, though look at the latest code on the
trunk for newer bug fixes).
Ideally Solr will distribute the timeout request parameter and automatically
interrupt all query threads once the threshold is met. This issue mirrors what
others have discussed on the mailing list:
http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E
Summary: Don't allow runaway queries from harming Solr cluster health
or search performance (was: When a query times out all distributed searches
shouldn't continue on until completion)
As a follow up, we are still experiencing frequent issues with this specific
issue which is getting more and more frequent. Upon further research it looks
like this is a somewhat common problem that afflicts various Lucene community
members. As noted in the description Apache Blur has implemented a mechanism
for coping but more recently Elastic Search has also implemented their own
solution which performs an up-front query heap estimation and will pull the
"circuit breaker" if it exceeds a threshold, thus not allowing the query to
crash their cluster.
Documentation:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-fielddata.html#fielddata-circuit-breaker
Ticket: https://github.com/elasticsearch/elasticsearch/issues/2929 &
https://github.com/elasticsearch/elasticsearch/pull/4261
If anyone has any suggestions on how we can limp by for the time being that
would also be greatly appreciated (unfortunately our user base needs to keep
using nested proximity wildcards but willing to have mechanisms in place to a
kill subset of problematic queries).
> Don't allow runaway queries from harming Solr cluster health or search
> performance
> ----------------------------------------------------------------------------------
>
> Key: SOLR-5986
> URL: https://issues.apache.org/jira/browse/SOLR-5986
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Steve Davids
> Priority: Critical
> Fix For: 4.9
>
>
> The intent of this ticket is to have all distributed search requests stop
> wasting CPU cycles on requests that have already timed out or are so
> complicated that they won't be able to execute. We have come across a case
> where a nasty wildcard query within a proximity clause was causing the
> cluster to enumerate terms for hours even though the query timeout was set to
> minutes. This caused a noticeable slowdown within the system which made us
> restart the replicas that happened to service that one request, the worst
> case scenario are users with a relatively low zk timeout value will have
> nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in
> BLUR-142 (see commit comment for code, though look at the latest code on the
> trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running
> by some heuristic (possibly estimated size of heap usage) or be able to
> execute a thread interrupt on all query threads once the time threshold is
> met. This issue mirrors what others have discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]