[jira] [Created] (SOLR-17348) Mitigate extreme parallelism of zkCallback executor

Michael Gibney (Jira) Tue, 25 Jun 2024 12:24:10 -0700

Michael Gibney created SOLR-17348:
-------------------------------------

             Summary: Mitigate extreme parallelism of zkCallback executor
                 Key: SOLR-17348
                 URL: https://issues.apache.org/jira/browse/SOLR-17348
             Project: Solr
          Issue Type: Improvement
            Reporter: Michael Gibney



zkCallback executor is [currently an unbounded thread pool of core size 
0|https://github.com/apache/solr/blob/709a1ee27df23b419d09fe8f67c3276409131a4a/solr/solrj-zookeeper/src/java/org/apache/solr/common/cloud/SolrZkClient.java#L91-L92],
 using a SynchronousQueue. Thus, a flood of zkCallback events (as might be 
triggered by a cluster restart, e.g.) can result in spinning up a very large 
number of threads. In practice we have encountered as many as 35k threads 
created in some such cases, even after the impact of this situation was reduced 
by the fix for SOLR-11535.

Inspired by [~cpoerschke]'s recent [closer look at thread pool 
behavior|https://issues.apache.org/jira/browse/SOLR-13350?focusedCommentId=17853178#comment-17853178],
 I wondered if we might be able to employ a bounded queue to alleviate some of 
the pressure from bursty zk callbacks.

The new config might look something like: {{corePoolSize=1024, 
maximumPoolSize=Integer.MAX_VALUE, allowCoreThreadTimeout=true, workQueue=new 
LinkedBlockingQueue<>(1024)}}. This would allow the pool to grow up to (and 
shrink from) corePoolSize in the same manner it currently does, but once 
exceeding corePoolSize (e.g. during a cluster restart or other callback flood 
event), tasks would be queued (up to some fixed limit). If the queue limit is 
exceeded, new threads would still be created, but we would have avoided the 
current “always create a thread” behavior, and by so doing hopefully reduce 
task execution time and improve overall throughput.

>From the ThreadPoolExecutor javadocs:

{quote}Direct handoffs. A good default choice for a work queue is a 
SynchronousQueue that hands off tasks to threads without otherwise holding 
them. Here, an attempt to queue a task will fail if no threads are immediately 
available to run it, so a new thread will be constructed. This policy avoids 
lockups when handling sets of requests that might have internal dependencies. 
Direct handoffs generally require unbounded maximumPoolSizes to avoid rejection 
of new submitted tasks. This in turn admits the possibility of unbounded thread 
growth when commands continue to arrive on average faster than they can be 
processed.{quote}

So afaict SynchronousQueue mainly makes sense if there exists the possibility 
of deadlock due to dependencies among tasks, and I think this should ideally 
_not_ be the case with zk callbacks (though in practice I'm not sure this is 
the case).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SOLR-17348) Mitigate extreme parallelism of zkCallback executor

Reply via email to