Pierre Salagnac created SOLR-17421:
--------------------------------------
Summary: With overseer node role enabled, overseer may be stopped
without giving-up leadership
Key: SOLR-17421
URL: https://issues.apache.org/jira/browse/SOLR-17421
Project: Solr
Issue Type: Bug
Security Level: Public (Default Security Level. Issues are Public)
Affects Versions: 9.6, 8.11
Reporter: Pierre Salagnac
Overseer may retain the leadership status while the thread pool that is
supposed to consume the collection state mutator queue was already shut down.
Occurrences of this but are probably not frequent. But when it happens, it has
a huge impact. The overseer cluster state updater is stuck and all collection
admin requests are very likely to fail. Because of the stuck overseer, all the
enqueued operations (collection creation, deletion...) fail and remain in the
collection API queue.
h2. Root cause
Root cause is the {{QUIT}} command does not cancel overseer election if any
error happens while shutting down the state updater thread pool.
{code:java}
level: ERROR
logger: org.apache.solr.cloud.Overseer
message: Overseer could not process the current clusterstate state update
message, skipping the message: {
"operation":"quit",
"id":"72073405485023239-<host>_solr-n_0000000948"}
node_name: <host>:8983_solr
threadId: 281272
threadName: OverseerStateUpdate-72073405485023239-<host>_solr-n_0000000948
thrown: java.lang.RuntimeException: Timeout waiting for pool
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@2c1da18d[Shutting
down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks =
0] to shutdown.
at
org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:142)
at
org.apache.solr.common.util.ExecutorUtil.awaitTermination(ExecutorUtil.java:129)
at
org.apache.solr.common.util.ExecutorUtil.shutdownAndAwaitTermination(ExecutorUtil.java:112)
at
org.apache.solr.cloud.OverseerTaskProcessor.close(OverseerTaskProcessor.java:431)
at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.processMessage(Overseer.java:601)
at
org.apache.solr.cloud.Overseer$ClusterStateUpdater.processQueueItem(Overseer.java:450)
at org.apache.solr.cloud.Overseer$ClusterStateUpdater.run(Overseer.java:377)
at java.base/java.lang.Thread.run(Thread.java:1583)
{code}
h2. Proximate cause
It seems to me long running operations in the collection API could trigger the
bug more frequently. Because of a long running operation, we get an exception
when shutting down the thread pool. This has a 60 seconds timeout.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]