[ 
https://issues.apache.org/jira/browse/SOLR-17754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pierre Salagnac updated SOLR-17754:
-----------------------------------
    Issue Type: Bug  (was: Task)

> Race condition in overseer main loop leaves collection/shard/replica locked
> ---------------------------------------------------------------------------
>
>                 Key: SOLR-17754
>                 URL: https://issues.apache.org/jira/browse/SOLR-17754
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 9.8
>            Reporter: Pierre Salagnac
>            Priority: Major
>
> We hit a rare race condition in overseer main loop, but it led to a full 
> stuck overseer. All operations submitted to the collection API were then 
> never dequeued and never processed until a restart of the overseer.
> For the race condition to reproduce, we need at least 100 concurrent tasks 
> running in the collection API. This bug is reproducible only when the 
> overseer is under high load.
>  
>  *Note on overseer threading*
> The overseer main loop in class {{OverseerTaskProcessor}} processes tasks 
> that were previously enqueued to the overseer distributed queue (stored in 
> Zookeeper). This class has a producer/consumer pattern:
>  - a single thread dequeues tasks from the distributed queue (named later the 
> _main loop_ thread)
>  - before submitting the task, the _main loop_ thread tries to lock on the 
> collection/shard/replica (depending on the task type). If the lock was 
> already taken because another task is already running on same resource, the 
> task is not submitted and will be reconsidered during a later iteration.
>  - then a pool with up to 100 threads executes the tasks (named later the 
> _runner_ threads).
>  - at the end of the task, the _runner_ thread unlocks the 
> collection/shard/replica.
>  
> Since the pool uses a {{{}SynchronousQueue{}}}, it is not allowed to submit 
> more than 100 concurrent tasks to the pool, otherwise the task is rejected by 
> the pool and an exception is raised (see the exception below). To avoid this 
> exception, the _main loop_ threads makes sure it does not submit more than 
> 100 concurrent tasks by maintaining the IDs of the current running tasks in 
> an in-memory collection. Each time a task is being submitted, the task ID is 
> added into collection {{{}runningTasks{}}}. Then, once the _runner_ thread 
> completed the task (same in case of a failure), it removes the task ID from 
> this collection. Before submitting a task to the _runner_ pool, the _main 
> loop_ thread checks the size of collection runningTasks. In case there are 
> already 100 items in the collection, the task is not submitted but instead 
> added into another collection ({{{}blockedTasks{}}}) so it will be submitted 
> to the pool in a later iteration, once the size of runningTask will be below 
> 100 (full mechanism not detailed here).
>  
> *Root cause race condition*
> The bug is that the list of IDs in collection {{runningTasks}} is not fully 
> aligned with list of _runner_ threads currently executing a task. The instant 
> right after the _runner_ thread removed the ID, the thread is still running 
> and hasn't returned back to the pool. But since the task ID is not in the 
> collection anymore, the _main loop_ thread thinks we have a room available to 
> submit a new task to the pool. There is a time window where the _main loop_ 
> thread may submit a 101th task the pool. Bellow exception is raised:
> {code:java}
> java.util.concurrent.RejectedExecutionException: Task 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$$Lambda/0x00000070013f2a30@53c6a1bb
>  rejected from 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor@1f9c1653[Running,
>  pool size = 100, active threads = 100, queued tasks = 0, completed tasks = 
> 100]
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2081)
>  ~[?:?]
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:841)
>  ~[?:?]
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1376)
>  ~[?:?]
>       at 
> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.execute(ExecutorUtil.java:361)
>  ~[main/:?]
>       at 
> org.apache.solr.cloud.OverseerTaskProcessor.run(OverseerTaskProcessor.java:376)
>  ~[main/:?]
>       at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]
> {code}
>  
> *Impact*
> Once we hit the above exception, the lock on the collection/shard/replica is 
> not released (it was acquired before submitting a task to the _runner_ thread 
> pool). This means that all the future tasks that require to lock on the same 
> collection/shard/replica won't be processed and will remaining the Zookeeper 
> queue forever, until the overseer is restarted.
>  
> Another impact is the overseer does not read more than 1000 tasks from the 
> queue. Consequently, if more than 1000 are blocked by the unreleased lock, 
> later tasks will not be read (and not processed) even if they don't require 
> the same lock. After 1000 tasks waiting for the lock, the overseer is fully 
> blocked. It may take a long time between the root race condition and reaching 
> a fully blocked overseer (days/weeks depending on the number of submitted API 
> calls).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to