[
https://issues.apache.org/jira/browse/SOLR-5986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062250#comment-14062250
]
Steve Davids commented on SOLR-5986:
------------------------------------
bq. I wonder why you would have to restart the replica? I presume this is
because that is your only recourse to stop a query that might take days to
complete?
Yes, that is correct, that is the easiest way to kill a run-away thread.
bq. If a query takes that long and is ignoring a specified timeout, that seems
like it's own issue that needs resolution.
The Solr instance that is distributing the requests to other shards honors the
timeout value and stops the collection process once the threshold is met (and
returns to the client with partial results if any are available), though the
queries remain running on all of the shards that were initially searched in the
overall distributed request. If the timeout value is honored on each shard that
was used in the distributed request that would probably take care of the
problem.
bq. IMHO, the primary goal should be to make SolrCloud clusters more resilient
to performance degradations caused by such nasty queries described above.
+1 resiliency to performance degradations is always a good thing :)
bq. The circuit-breaker approach in the linked ES tickets is clever, but it
does not seem to be as generally applicable as the ability to view all running
queries with an option to stop them.
+1 I actually prefer the BLUR route, though being able to see the current
queries plus the ability to kill them off across the cluster would be great.
Although it is crucial to be able to automatically have queries be killed off
after a certain threshold (ideally the timeout value). This is necessary
because I don't want to be monitoring the Solr admin page at all hours during
the day (though I could create scripts to do the work if an API call is
available, but not preferred).
bq. My preference would be to have a response mechanism that 1) applies broadly
and 2) a dev-ops guy can execute in a UI like Solr Admin, or even by API.
+1 if "applied broadly" means ability to specify a threshold to start killing
off queries.
> Don't allow runaway queries from harming Solr cluster health or search
> performance
> ----------------------------------------------------------------------------------
>
> Key: SOLR-5986
> URL: https://issues.apache.org/jira/browse/SOLR-5986
> Project: Solr
> Issue Type: Improvement
> Components: search
> Reporter: Steve Davids
> Priority: Critical
> Fix For: 4.9
>
>
> The intent of this ticket is to have all distributed search requests stop
> wasting CPU cycles on requests that have already timed out or are so
> complicated that they won't be able to execute. We have come across a case
> where a nasty wildcard query within a proximity clause was causing the
> cluster to enumerate terms for hours even though the query timeout was set to
> minutes. This caused a noticeable slowdown within the system which made us
> restart the replicas that happened to service that one request, the worst
> case scenario are users with a relatively low zk timeout value will have
> nodes start dropping from the cluster due to long GC pauses.
> [~amccurry] Built a mechanism into Apache Blur to help with the issue in
> BLUR-142 (see commit comment for code, though look at the latest code on the
> trunk for newer bug fixes).
> Solr should be able to either prevent these problematic queries from running
> by some heuristic (possibly estimated size of heap usage) or be able to
> execute a thread interrupt on all query threads once the time threshold is
> met. This issue mirrors what others have discussed on the mailing list:
> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200903.mbox/%[email protected]%3E
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]