[
https://issues.apache.org/jira/browse/SOLR-17348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860019#comment-17860019
]
Michael Gibney commented on SOLR-17348:
---------------------------------------
In order to naively check for possible deadlocks if the parallelism of
zkCallback executor were modified, I changed it to
{{ExecutorUtil.newMDCAwareSingleThreadExecutor()}}. Most of the tests were fine
(I did not run nightlies); there were only two tests that failed, one of which
({{ZkSolrClientTest.testMultipleWatchesAsync}}) is not really worth mentioning
because iiuc it explicitly tests for parallel callback execution, so it should
be expected to fail.
The concerning failure can reproduced via:
{code:sh}
./gradlew :solr:core:test --tests
"org.apache.solr.cloud.TestCloudConsistency.testOutOfSyncReplicasCannotBecomeLeader"
-Ptests.jvms=5 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:+UseParallelGC
-XX:ActiveProcessorCount=1 -XX:ReservedCodeCacheSize=120m"
-Ptests.seed=56FF65ADA5A59077 -Ptests.file.encoding=ISO-8859-1
{code}
I'm not sure what the deadlock here is, but I'd like to know whether it means:
# full parallelism of zk callbacks is really required, and this just won't work
# with a few changes, zk callbacks would _not_ require parallelism
Note also that I wouldn't assume that a passing test suite would mean
everything is fine. I don't plan to pursue this further atm, but I wanted to
drop my thoughts and experiments into this issue in case someone feels inclined
to follow up on it. It's possible that a deep dive here might even improve
tangential things (e.g., perhaps leader elections?) in unanticipated ways ...
> Mitigate extreme parallelism of zkCallback executor
> ---------------------------------------------------
>
> Key: SOLR-17348
> URL: https://issues.apache.org/jira/browse/SOLR-17348
> Project: Solr
> Issue Type: Improvement
> Reporter: Michael Gibney
> Priority: Minor
>
> zkCallback executor is [currently an unbounded thread pool of core size
> 0|https://github.com/apache/solr/blob/709a1ee27df23b419d09fe8f67c3276409131a4a/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L91-L92],
> using a SynchronousQueue. Thus, a flood of zkCallback events (as might be
> triggered by a cluster restart, e.g.) can result in spinning up a very large
> number of threads. In practice we have encountered as many as 35k threads
> created in some such cases, even after the impact of this situation was
> reduced by the fix for SOLR-11535.
> Inspired by [~cpoerschke]'s recent [closer look at thread pool
> behavior|https://issues.apache.org/jira/browse/SOLR-13350?focusedCommentId=17853178#comment-17853178],
> I wondered if we might be able to employ a bounded queue to alleviate some
> of the pressure from bursty zk callbacks.
> The new config might look something like: {{corePoolSize=1024,
> maximumPoolSize=Integer.MAX_VALUE, allowCoreThreadTimeout=true, workQueue=new
> LinkedBlockingQueue<>(1024)}}. This would allow the pool to grow up to (and
> shrink from) corePoolSize in the same manner it currently does, but once
> exceeding corePoolSize (e.g. during a cluster restart or other callback flood
> event), tasks would be queued (up to some fixed limit). If the queue limit is
> exceeded, new threads would still be created, but we would have avoided the
> current “always create a thread” behavior, and by so doing hopefully reduce
> task execution time and improve overall throughput.
> From the ThreadPoolExecutor javadocs:
> {quote}Direct handoffs. A good default choice for a work queue is a
> SynchronousQueue that hands off tasks to threads without otherwise holding
> them. Here, an attempt to queue a task will fail if no threads are
> immediately available to run it, so a new thread will be constructed. This
> policy avoids lockups when handling sets of requests that might have internal
> dependencies. Direct handoffs generally require unbounded maximumPoolSizes to
> avoid rejection of new submitted tasks. This in turn admits the possibility
> of unbounded thread growth when commands continue to arrive on average faster
> than they can be processed.{quote}
> So afaict SynchronousQueue mainly makes sense if there exists the possibility
> of deadlock due to dependencies among tasks, and I think this should ideally
> _not_ be the case with zk callbacks (though in practice I'm not sure this is
> the case).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]