[ https://issues.apache.org/jira/browse/SOLR-17754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SOLR-17754: ---------------------------------- Labels: pull-request-available (was: ) > Race condition in overseer main loop leaves collection/shard/replica locked > --------------------------------------------------------------------------- > > Key: SOLR-17754 > URL: https://issues.apache.org/jira/browse/SOLR-17754 > Project: Solr > Issue Type: Bug > Components: SolrCloud > Affects Versions: 9.8 > Reporter: Pierre Salagnac > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We hit a rare race condition in overseer main loop, but it led to a full > stuck overseer. All operations submitted to the collection API were then > never dequeued and never processed until a restart of the overseer. > For the race condition to reproduce, we need at least 100 concurrent tasks > running in the collection API. This bug is reproducible only when the > overseer is under high load. > > *Note on overseer threading* > The overseer main loop in class {{OverseerTaskProcessor}} processes tasks > that were previously enqueued to the overseer distributed queue (stored in > Zookeeper). This class has a producer/consumer pattern: > - a single thread dequeues tasks from the distributed queue (named later the > _main loop_ thread) > - before submitting the task, the _main loop_ thread tries to lock on the > collection/shard/replica (depending on the task type). If the lock was > already taken because another task is already running on same resource, the > task is not submitted and will be reconsidered during a later iteration. > - then a pool with up to 100 threads executes the tasks (named later the > _runner_ threads). > - at the end of the task, the _runner_ thread unlocks the > collection/shard/replica. > > Since the pool uses a {{{}SynchronousQueue{}}}, it is not allowed to submit > more than 100 concurrent tasks to the pool, otherwise the task is rejected by > the pool and an exception is raised (see the exception below). To avoid this > exception, the _main loop_ threads makes sure it does not submit more than > 100 concurrent tasks by maintaining the IDs of the current running tasks in > an in-memory collection. Each time a task is being submitted, the task ID is > added into collection {{{}runningTasks{}}}. Then, once the _runner_ thread > completed the task (same in case of a failure), it removes the task ID from > this collection. Before submitting a task to the _runner_ pool, the _main > loop_ thread checks the size of collection runningTasks. In case there are > already 100 items in the collection, the task is not submitted but instead > added into another collection ({{{}blockedTasks{}}}) so it will be submitted > to the pool in a later iteration, once the size of runningTask will be below > 100 (full mechanism not detailed here). > > *Root cause race condition* > The bug is that the list of IDs in collection {{runningTasks}} is not fully > aligned with list of _runner_ threads currently executing a task. The instant > right after the _runner_ thread removed the ID, the thread is still running > and hasn't returned back to the pool. But since the task ID is not in the > collection anymore, the _main loop_ thread thinks we have a room available to > submit a new task to the pool. There is a time window where the _main loop_ > thread may submit a 101th task the pool. Bellow exception is raised: > {code:java} > java.util.concurrent.RejectedExecutionException: Task > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda/0x00000070013f2a30@53c6a1bb > rejected from > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1f9c1653[Running, > pool size = 100, active threads = 100, queued tasks = 0, completed tasks = > 100] > at > java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081) > ~[?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841) > ~[?:?] > at > java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376) > ~[?:?] > at > org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:361) > ~[main/:?] > at > org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:376) > ~[main/:?] > at java.base/java.lang.Thread.run(Thread.java:1583) [?:?] > {code} > > *Impact* > Once we hit the above exception, the lock on the collection/shard/replica is > not released (it was acquired before submitting a task to the _runner_ thread > pool). This means that all the future tasks that require to lock on the same > collection/shard/replica won't be processed and will remaining the Zookeeper > queue forever, until the overseer is restarted. > > Another impact is the overseer does not read more than 1000 tasks from the > queue. Consequently, if more than 1000 are blocked by the unreleased lock, > later tasks will not be read (and not processed) even if they don't require > the same lock. After 1000 tasks waiting for the lock, the overseer is fully > blocked. It may take a long time between the root race condition and reaching > a fully blocked overseer (days/weeks depending on the number of submitted API > calls). -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org