[
https://issues.apache.org/jira/browse/SOLR-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Torsten Bøgh Köster updated SOLR-16497:
---------------------------------------
Issue Type: Improvement (was: Bug)
> Allow finer grained locking in SolrCores
> ----------------------------------------
>
> Key: SOLR-16497
> URL: https://issues.apache.org/jira/browse/SOLR-16497
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Components: search
> Affects Versions: 9.0, 8.11.2
> Reporter: Torsten Bøgh Köster
> Priority: Major
> Attachments: solrcores_locking.png, solrcores_locking_fixed.png
>
>
> Access to loaded SolrCore instances is a synchronized read and write
> operation in SolrCores#getCoreFromAnyList. This method is touched by every
> request as every HTTP request is assigned the SolrCore it operates on.
> h3. Background
> Under heavy load we discovered that application halts inside of Solr are
> becoming a serious problem in high traffic environments. Using Java Flight
> Recordings we discovered high accumulated applications halts on the
> modifyLock in SolrCores. In our case this means that we can only utilize our
> machines up to 25% cpu usage. With the fix applied, a utilization up to 80%
> is perfectly doable.
> In our case this specific locking problem was masked by another locking
> problem in the SlowCompositeReaderWrapper. We'll submit our fix to the
> locking problem in the SlowCompositeReaderWrapper in a following issue.
> h3. Problem
> Our Solr instances utilizes the collapse component heavily. The instances run
> with 32 cores and 32gb Java heap on a rather small index (4gb). The instances
> scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60
> seconds
> as soon the cpu usage exceeds 50%.
> !solrcores_locking.png|height=1024px!
> During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads
> accumulated more than 12h locking time inside SolrCores on the modifyLock
> instance used as synchronized lock (see screenshot). With this fix the
> locking access is reduced to write accesses only. We validated this using
> another JFR snapshot:
> !solrcores_locking_fixed.png|height=1024px!
> We ran this code for a couple of weeks in our live environment.
> h3. Implementation
> The synchronized modifyLock is replaced by a ReentrantReadWriteLock. This
> allows concurrent reads from the internal SolrCore instance list (cores) but
> grants exclusive access to write operations.
> We need to ensure that only a single transientSolrCoreCache inside
> TransientSolrCoreCacheFactoryDefault is created. As we now allow multiple
> read threads, we call the the getTransientCacheHandler() method initially
> inside a write lock inside the load() method. This ensures that a single
> instance of transientSolrCoreCache is created.
> The lock signaling between SolrCore and CoreContainer gets replaced by a
> Condition that is tied to the write lock.
> h3. Summary
> This change allows for a finer grained access to the list of open SolrCores.
> The decreased blocking read access is noticeable in decreased blocking times
> of the Solr application (see screenshot).
> This change has been composed together by Dennis Berger, Torsten Bøgh Köster
> and Marco Petris.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]