[ 
https://issues.apache.org/jira/browse/SOLR-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627057#comment-17627057
 ] 

Torsten Bøgh Köster commented on SOLR-16497:
--------------------------------------------

GitHub PR is finally done. Adoption to Solr 9.x took some time but tests are 
finally green :-)

> Allow finer grained locking in SolrCores
> ----------------------------------------
>
>                 Key: SOLR-16497
>                 URL: https://issues.apache.org/jira/browse/SOLR-16497
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search
>    Affects Versions: 9.0, 8.11.2
>            Reporter: Torsten Bøgh Köster
>            Priority: Major
>         Attachments: solrcores_locking.png, solrcores_locking_fixed.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Access to loaded SolrCore instances is a synchronized read and write 
> operation in SolrCores#getCoreFromAnyList. This method is touched by every 
> request as every HTTP request is assigned the SolrCore it operates on.
> h3. Background
> Under heavy load we discovered that application halts inside of Solr are 
> becoming a serious problem in high traffic environments. Using Java Flight 
> Recordings we discovered high accumulated applications halts on the 
> modifyLock in SolrCores. In our case this means that we can only utilize our 
> machines up to 25% cpu usage. With the fix applied, a utilization up to 80% 
> is perfectly doable.
> In our case this specific locking problem was masked by another locking 
> problem in the SlowCompositeReaderWrapper. We'll submit our fix to the 
> locking problem in the SlowCompositeReaderWrapper in a following issue.
> h3. Problem
> Our Solr instances utilizes the collapse component heavily. The instances run 
> with 32 cores and 32gb Java heap on a rather small index (4gb). The instances 
> scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60 
> seconds
> as soon the cpu usage exceeds 50%.
>  !solrcores_locking.png|height=1024px! 
> During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads 
> accumulated more than 12h locking time inside SolrCores on the modifyLock 
> instance used as synchronized lock (see screenshot). With this fix the 
> locking access is reduced to write accesses only. We validated this using 
> another JFR snapshot:
>  !solrcores_locking_fixed.png|height=1024px! 
> We ran this code for a couple of weeks in our live environment.
> h3. Solution
> The synchronized modifyLock is replaced by a ReentrantReadWriteLock. This 
> allows concurrent reads from the internal SolrCore instance list (cores) but 
> grants exclusive access to write operations on the instance list (cores).
> In Solr 9.x the cache inside the TransientSolrCoreCacheFactoryDefault adds a 
> cache overflow handling of the size based internal cache (SOLR-15964). As 
> soon as SolrCores are evicted from the internal cache, the cache behaviour 
> changes from a size based cache to a reference based cache via the cache's 
> eviction handler. SolrCore instances that are still referenced are inserted 
> back into the cache. This means that write operations to the cache (insert 
> SolrCore) can be issued during read operations in SolrCores. Hence these 
> operations have only a read lock which cannot be upgraded to a write lock 
> (dead lock).
> To overcome this, we moved the cache maintenance (including the eviction 
> handler) in TransientSolrCoreCacheFactoryDefault to a separate thread. This 
> thread can acquire a write lock but on the other hand a separate thread will 
> schedule a ping-pong behaviour in the eviction handler on a full cache with 
> SolrCores still referenced. To overcome this we made the overflow behaviour 
> transparent by adding an additional overflowCores instance. Here we add 
> evicted but still referenced cores from the transientCores cache.
> Furthermore we need to ensure that only a single transientSolrCoreCache 
> inside TransientSolrCoreCacheFactoryDefault is created. As we now allow 
> multiple read threads, we call the the getTransientCacheHandler() method 
> initially holding a write lock inside the load() method. Calling the method 
> only needs a write lock initially (for cache creation). For all other calls, 
> a read lock is sufficient. By default, the getTransientCacheHandler() 
> acquires a read lock. If a write is needed (e.g. for core creation), the 
> getTransientCacheHandlerInternal() is called. This method explicitly does not 
> use a lock in order to provide the flexibility to choose between a read-lock 
> and a write-lock. This ensures that a single instance of 
> transientSolrCoreCache is created.
> The lock signaling between SolrCore and CoreContainer gets replaced by a 
> Condition that is tied to the write lock.
> h3. Summary
> This change allows for a finer grained access to the list of open SolrCores. 
> The decreased blocking read access is noticeable in decreased blocking times 
> of the Solr application (see screenshot).
> This change has been composed together by Dennis Berger, Torsten Bøgh Köster 
> and Marco Petris.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to