tboeghk opened a new pull request, #1155:
URL: https://github.com/apache/solr/pull/1155

   https://issues.apache.org/jira/browse/SOLR-16497
   
   Access to loaded `SolrCore` instances is a synchronized read and write 
operation in `SolrCores#getCoreFromAnyList`. This method is touched by every 
request as every HTTP request is assigned the `SolrCore` it operates on.
   
   ### Background
   
   Under heavy load we discovered that application halts inside of Solr are 
becoming a serious problem in high traffic environments. Using Java Flight 
Recordings we discovered high accumulated applications halts on the 
`modifyLock` in `SolrCores`. In our case this means that we can only utilize 
our machines up to 25% cpu usage. With the fix applied, a utilization up to 80% 
is perfectly doable.
   
   > In our case this specific locking problem was masked by another locking 
problem in the `SlowCompositeReaderWrapper`. > We'll submit our fix to the 
locking problem in the `SlowCompositeReaderWrapper` in a following issue.
   
   ### Description
   
   Our Solr instances utilizes the `collapse` component heavily. The instances 
run with 32 cores and 32gb Java heap on a rather small index (4gb). The 
instances scale out at 50% cpu load. We take Java Flight Recorder snapshots of 
60 seconds
   as soon the cpu usage exceeds 50%.
   
   <img width="993" alt="solr-issues-solrcores-locking" 
src="https://user-images.githubusercontent.com/557264/196221505-c0d819ea-574a-41f5-80ae-1b0218225b32.png";>
   
   During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads 
accumulated more than 12h locking time inside `SolrCores` on the `modifyLock` 
instance. The `modifyLock` instance is used as a synchronized lock (see 
screenshot). With this fix applied, the locking access is reduced to write 
accesses only. We validated this using another JFR snapshot:
   
   <img width="990" alt="solr-issues-solrcores-after" 
src="https://user-images.githubusercontent.com/557264/196221528-362e5d7f-022a-4aa8-9cd7-844f59a61102.png";>
   
   We ran this code for a couple of weeks in our live environment in a 
backported version on a Solr version 8.11.2. This fix is built against the 
`main` branch.
   
   ### Solution
   
   The synchronized `modifyLock` is replaced by a 
[`ReentrantReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReentrantReadWriteLock.html).
 This allows concurrent reads from the internal `SolrCore` instance list 
(`cores`) but grants exclusive access to write operations on the instance list 
(`cores`).
   
   In Solr 9.x the cache inside the `TransientSolrCoreCacheFactoryDefault` adds 
a cache overflow handling of the size based internal cache 
([SOLR-15964](https://issues.apache.org/jira/browse/SOLR-15964)). As soon as 
`SolrCore`s are evicted from the internal cache, the cache behaviour changes 
from a size based cache to a reference based cache via the cache's eviction 
handler. `SolrCore` instances that are still referenced are inserted back into 
the cache. This means that write operations to the cache (insert `SolrCore`) 
can be issued during read operations in `SolrCores`. Hence these operations 
have only a read lock which cannot be upgraded to a write lock (dead lock).
   
   To overcome this, we moved the cache maintenance (including the eviction 
handler) in `TransientSolrCoreCacheFactoryDefault` to a separate thread. This 
thread can acquire a write lock but on the other hand a separate thread will 
schedule a ping-pong behaviour in the eviction handler on a full cache with 
`SolrCore`s still referenced. To overcome this we made the overflow behaviour 
transparent by adding an additional `overflowCores` instance. Here we add 
evicted but still referenced cores from the `transientCores` cache.
   
   Furthermore we need to ensure that only a single `transientSolrCoreCache` 
inside `TransientSolrCoreCacheFactoryDefault` is created. As we now allow 
multiple read threads, we call the the `getTransientCacheHandler()` method 
initially holding a write lock inside the `load()` method. Calling the method 
only needs a write lock initially (for cache creation). For all other  calls, a 
read lock is sufficient. By default, the `getTransientCacheHandler()` acquires 
a read lock. If a write is needed (e.g. for core creation), the 
`getTransientCacheHandlerInternal()` is called. This method explicitly does not 
use a lock in order to provide the flexibility to choose between a read-lock 
and a write-lock. This ensures that a single instance of 
`transientSolrCoreCache` is created.
   
   The lock signaling between `SolrCore` and `CoreContainer` gets replaced by a
   
[`Condition`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/Condition.html)
 that is tied to the write lock. 
   
   This change allows for a finer grained access to the list of open 
`SolrCores`. The decreased blocking read access is noticeable in decreased 
blocking times of the Solr application (see screenshot).
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms 
to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Solr maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   - [ ] I have added documentation for the [Reference 
Guide](https://github.com/apache/solr/tree/main/solr/solr-ref-guide)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to