[
https://issues.apache.org/jira/browse/SOLR-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17627057#comment-17627057
]
Torsten Bøgh Köster commented on SOLR-16497:
--------------------------------------------
GitHub PR is finally done. Adoption to Solr 9.x took some time but tests are
finally green :-)
> Allow finer grained locking in SolrCores
> ----------------------------------------
>
> Key: SOLR-16497
> URL: https://issues.apache.org/jira/browse/SOLR-16497
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: search
> Affects Versions: 9.0, 8.11.2
> Reporter: Torsten Bøgh Köster
> Priority: Major
> Attachments: solrcores_locking.png, solrcores_locking_fixed.png
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Access to loaded SolrCore instances is a synchronized read and write
> operation in SolrCores#getCoreFromAnyList. This method is touched by every
> request as every HTTP request is assigned the SolrCore it operates on.
> h3. Background
> Under heavy load we discovered that application halts inside of Solr are
> becoming a serious problem in high traffic environments. Using Java Flight
> Recordings we discovered high accumulated applications halts on the
> modifyLock in SolrCores. In our case this means that we can only utilize our
> machines up to 25% cpu usage. With the fix applied, a utilization up to 80%
> is perfectly doable.
> In our case this specific locking problem was masked by another locking
> problem in the SlowCompositeReaderWrapper. We'll submit our fix to the
> locking problem in the SlowCompositeReaderWrapper in a following issue.
> h3. Problem
> Our Solr instances utilizes the collapse component heavily. The instances run
> with 32 cores and 32gb Java heap on a rather small index (4gb). The instances
> scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60
> seconds
> as soon the cpu usage exceeds 50%.
> !solrcores_locking.png|height=1024px!
> During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads
> accumulated more than 12h locking time inside SolrCores on the modifyLock
> instance used as synchronized lock (see screenshot). With this fix the
> locking access is reduced to write accesses only. We validated this using
> another JFR snapshot:
> !solrcores_locking_fixed.png|height=1024px!
> We ran this code for a couple of weeks in our live environment.
> h3. Solution
> The synchronized modifyLock is replaced by a ReentrantReadWriteLock. This
> allows concurrent reads from the internal SolrCore instance list (cores) but
> grants exclusive access to write operations on the instance list (cores).
> In Solr 9.x the cache inside the TransientSolrCoreCacheFactoryDefault adds a
> cache overflow handling of the size based internal cache (SOLR-15964). As
> soon as SolrCores are evicted from the internal cache, the cache behaviour
> changes from a size based cache to a reference based cache via the cache's
> eviction handler. SolrCore instances that are still referenced are inserted
> back into the cache. This means that write operations to the cache (insert
> SolrCore) can be issued during read operations in SolrCores. Hence these
> operations have only a read lock which cannot be upgraded to a write lock
> (dead lock).
> To overcome this, we moved the cache maintenance (including the eviction
> handler) in TransientSolrCoreCacheFactoryDefault to a separate thread. This
> thread can acquire a write lock but on the other hand a separate thread will
> schedule a ping-pong behaviour in the eviction handler on a full cache with
> SolrCores still referenced. To overcome this we made the overflow behaviour
> transparent by adding an additional overflowCores instance. Here we add
> evicted but still referenced cores from the transientCores cache.
> Furthermore we need to ensure that only a single transientSolrCoreCache
> inside TransientSolrCoreCacheFactoryDefault is created. As we now allow
> multiple read threads, we call the the getTransientCacheHandler() method
> initially holding a write lock inside the load() method. Calling the method
> only needs a write lock initially (for cache creation). For all other calls,
> a read lock is sufficient. By default, the getTransientCacheHandler()
> acquires a read lock. If a write is needed (e.g. for core creation), the
> getTransientCacheHandlerInternal() is called. This method explicitly does not
> use a lock in order to provide the flexibility to choose between a read-lock
> and a write-lock. This ensures that a single instance of
> transientSolrCoreCache is created.
> The lock signaling between SolrCore and CoreContainer gets replaced by a
> Condition that is tied to the write lock.
> h3. Summary
> This change allows for a finer grained access to the list of open SolrCores.
> The decreased blocking read access is noticeable in decreased blocking times
> of the Solr application (see screenshot).
> This change has been composed together by Dennis Berger, Torsten Bøgh Köster
> and Marco Petris.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]