[ https://issues.apache.org/jira/browse/SOLR-17754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17951216#comment-17951216 ]
Pierre Salagnac commented on SOLR-17754: ---------------------------------------- Thanks [~dsmiley] :) I'm opening a pull request to fix this bug. The change also fixes a minor logging bug in the task queue, so adding details here For each (Collection API) overseer task, we try to set data in the result node multiple times. That why we have the log message {{"Response ZK path: _<node>_ doesn't exist. Requestor may have disconnected from ZooKeeper"}} so often. * For each synchronous task ({{{}asyncId == null{}}}), we set the result twice (we call {{workQueue.remove()}} twice), once in the main loop thread and once in the task runner thread. Depending on thread scheduling, either the main loop thread of the runner thread is the first. There is a chance the node was already removed by the client thread (jetty) at the second attempt of writing the result. * For each asynchronous task ({{{}asyncId != null{}}}), we don't create the result node at all when enqueuing the task in the queue. The result is supposed to be written in {{/overseer/collection-map-completed}} ZK distributed map instead (or similar for a failure). But current code still tries to write to the unixesting response node. Consequently, the message is logged twice for each asynchronous task. This fix removes the erroneous messages. Wow only the runner sub-thread removes the item of task completion, and we make sure we try to store the result in the response node only we synchronous tasks. > Race condition in overseer main loop leaves collection/shard/replica locked > --------------------------------------------------------------------------- > > Key: SOLR-17754 > URL: https://issues.apache.org/jira/browse/SOLR-17754 > Project: Solr > Issue Type: Task > Affects Versions: 9.8 > Reporter: Pierre Salagnac > Priority: Major > > We hit a rare race condition in overseer main loop, but it led to a full > stuck overseer. All operations submitted to the collection API were then > never dequeued and never processed until a restart of the overseer. > For the race condition to reproduce, we need at least 100 concurrent tasks > running in the collection API. This bug is reproducible only when the > overseer is under high load. > > *Note on overseer threading* > The overseer main loop in class {{OverseerTaskProcessor}} processes tasks > that were previously enqueued to the overseer distributed queue (stored in > Zookeeper). This class has a producer/consumer pattern: > - a single thread dequeues tasks from the distributed queue (named later the > _main loop_ thread) > - before submitting the task, the _main loop_ thread tries to lock on the > collection/shard/replica (depending on the task type). If the lock was > already taken because another task is already running on same resource, the > task is not submitted and will be reconsidered during a later iteration. > - then a pool with up to 100 threads executes the tasks (named later the > _runner_ threads). > - at the end of the task, the _runner_ thread unlocks the > collection/shard/replica. > > Since the pool uses a {{{}SynchronousQueue{}}}, it is not allowed to submit > more than 100 concurrent tasks to the pool, otherwise the task is rejected by > the pool and an exception is raised (see the exception below). To avoid this > exception, the _main loop_ threads makes sure it does not submit more than > 100 concurrent tasks by maintaining the IDs of the current running tasks in > an in-memory collection. Each time a task is being submitted, the task ID is > added into collection {{{}runningTasks{}}}. Then, once the _runner_ thread > completed the task (same in case of a failure), it removes the task ID from > this collection. Before submitting a task to the _runner_ pool, the _main > loop_ thread checks the size of collection runningTasks. In case there are > already 100 items in the collection, the task is not submitted but instead > added into another collection ({{{}blockedTasks{}}}) so it will be submitted > to the pool in a later iteration, once the size of runningTask will be below > 100 (full mechanism not detailed here). > > *Root cause race condition* > The bug is that the list of IDs in collection {{runningTasks}} is not fully > aligned with list of _runner_ threads currently executing a task. The instant > right after the _runner_ thread removed the ID, the thread is still running > and hasn't returned back to the pool. But since the task ID is not in the > collection anymore, the _main loop_ thread thinks we have a room available to > submit a new task to the pool. There is a time window where the _main loop_ > thread may submit a 101th task the pool. Bellow exception is raised: > {code:java} > java.util.concurrent.RejectedExecutionException: Task > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda/0x00000070013f2a30@53c6a1bb > rejected from > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1f9c1653[Running, > pool size = 100, active threads = 100, queued tasks = 0, completed tasks = > 100] > at > java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081) > ~[?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841) > ~[?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376) > ~[?:?] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:361) > ~[main/:?] > at > org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:376) > ~[main/:?] > at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] > {code} > > *Impact* > Once we hit the above exception, the lock on the collection/shard/replica is > not released (it was acquired before submitting a task to the _runner_ thread > pool). This means that all the future tasks that require to lock on the same > collection/shard/replica won't be processed and will remaining the Zookeeper > queue forever, until the overseer is restarted. > > Another impact is the overseer does not read more than 1000 tasks from the > queue. Consequently, if more than 1000 are blocked by the unreleased lock, > later tasks will not be read (and not processed) even if they don't require > the same lock. After 1000 tasks waiting for the lock, the overseer is fully > blocked. It may take a long time between the root race condition and reaching > a fully blocked overseer (days/weeks depending on the number of submitted API > calls). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org