[ 
https://issues.apache.org/jira/browse/SOLR-16497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Torsten Bøgh Köster updated SOLR-16497:
---------------------------------------
    Description: 
Access to loaded SolrCore instances is a synchronized read and write operation 
in SolrCores#getCoreFromAnyList. This method is touched by every request as 
every HTTP request is assigned the SolrCore it operates on.

h3. Background

Under heavy load we discovered that application halts inside of Solr are 
becoming a serious problem in high traffic environments. Using Java Flight 
Recordings we discovered high accumulated applications halts on the modifyLock 
in SolrCores. In our case this means that we can only utilize our machines up 
to 25% cpu usage. With the fix applied, a utilization up to 80% is perfectly 
doable.

In our case this specific locking problem was masked by another locking problem 
in the SlowCompositeReaderWrapper. We'll submit our fix to the locking problem 
in the SlowCompositeReaderWrapper in a following issue.

h3. Problem

Our Solr instances utilizes the collapse component heavily. The instances run 
with 32 cores and 32gb Java heap on a rather small index (4gb). The instances 
scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60 seconds
as soon the cpu usage exceeds 50%.

 !solrcores_locking.png|width=1024! 

During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads 
accumulated more than 12h locking time inside SolrCores on the modifyLock 
instance used as synchronized lock (see screenshot). With this fix the locking 
access is reduced to write accesses only. We validated this using another JFR 
snapshot:

 !solrcores_locking_fixed.png|width=1024! 

We ran this code for a couple of weeks in our live environment.

h3. Solution

The synchronized modifyLock is replaced by a ReentrantReadWriteLock. This 
allows concurrent reads from the internal SolrCore instance list (cores) but 
grants exclusive access to write operations on the instance list (cores).

In Solr 9.x the cache inside the TransientSolrCoreCacheFactoryDefault adds a 
cache overflow handling of the size based internal cache (SOLR-15964). As soon 
as SolrCores are evicted from the internal cache, the cache behaviour changes 
from a size based cache to a reference based cache via the cache's eviction 
handler. SolrCore instances that are still referenced are inserted back into 
the cache. This means that write operations to the cache (insert SolrCore) can 
be issued during read operations in SolrCores. Hence these operations have only 
a read lock which cannot be upgraded to a write lock (dead lock).

To overcome this, we moved the cache maintenance (including the eviction 
handler) in TransientSolrCoreCacheFactoryDefault to a separate thread. This 
thread can acquire a write lock but on the other hand a separate thread will 
schedule a ping-pong behaviour in the eviction handler on a full cache with 
SolrCores still referenced. To overcome this we made the overflow behaviour 
transparent by adding an additional overflowCores instance. Here we add evicted 
but still referenced cores from the transientCores cache.

Furthermore we need to ensure that only a single transientSolrCoreCache inside 
TransientSolrCoreCacheFactoryDefault is created. As we now allow multiple read 
threads, we call the the getTransientCacheHandler() method initially holding a 
write lock inside the load() method. Calling the method only needs a write lock 
initially (for cache creation). For all other calls, a read lock is sufficient. 
By default, the getTransientCacheHandler() acquires a read lock. If a write is 
needed (e.g. for core creation), the getTransientCacheHandlerInternal() is 
called. This method explicitly does not use a lock in order to provide the 
flexibility to choose between a read-lock and a write-lock. This ensures that a 
single instance of transientSolrCoreCache is created.

The lock signaling between SolrCore and CoreContainer gets replaced by a 
Condition that is tied to the write lock.

h3. Summary

This change allows for a finer grained access to the list of open SolrCores. 
The decreased blocking read access is noticeable in decreased blocking times of 
the Solr application (see screenshot).

This change has been composed together by Dennis Berger, Torsten Bøgh Köster 
and Marco Petris.

  was:
Access to loaded SolrCore instances is a synchronized read and write operation 
in SolrCores#getCoreFromAnyList. This method is touched by every request as 
every HTTP request is assigned the SolrCore it operates on.

h3. Background

Under heavy load we discovered that application halts inside of Solr are 
becoming a serious problem in high traffic environments. Using Java Flight 
Recordings we discovered high accumulated applications halts on the modifyLock 
in SolrCores. In our case this means that we can only utilize our machines up 
to 25% cpu usage. With the fix applied, a utilization up to 80% is perfectly 
doable.

In our case this specific locking problem was masked by another locking problem 
in the SlowCompositeReaderWrapper. We'll submit our fix to the locking problem 
in the SlowCompositeReaderWrapper in a following issue.

h3. Problem

Our Solr instances utilizes the collapse component heavily. The instances run 
with 32 cores and 32gb Java heap on a rather small index (4gb). The instances 
scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60 seconds
as soon the cpu usage exceeds 50%.

 !solrcores_locking.png|height=1024px! 

During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads 
accumulated more than 12h locking time inside SolrCores on the modifyLock 
instance used as synchronized lock (see screenshot). With this fix the locking 
access is reduced to write accesses only. We validated this using another JFR 
snapshot:

 !solrcores_locking_fixed.png|height=1024px! 

We ran this code for a couple of weeks in our live environment.

h3. Solution

The synchronized modifyLock is replaced by a ReentrantReadWriteLock. This 
allows concurrent reads from the internal SolrCore instance list (cores) but 
grants exclusive access to write operations on the instance list (cores).

In Solr 9.x the cache inside the TransientSolrCoreCacheFactoryDefault adds a 
cache overflow handling of the size based internal cache (SOLR-15964). As soon 
as SolrCores are evicted from the internal cache, the cache behaviour changes 
from a size based cache to a reference based cache via the cache's eviction 
handler. SolrCore instances that are still referenced are inserted back into 
the cache. This means that write operations to the cache (insert SolrCore) can 
be issued during read operations in SolrCores. Hence these operations have only 
a read lock which cannot be upgraded to a write lock (dead lock).

To overcome this, we moved the cache maintenance (including the eviction 
handler) in TransientSolrCoreCacheFactoryDefault to a separate thread. This 
thread can acquire a write lock but on the other hand a separate thread will 
schedule a ping-pong behaviour in the eviction handler on a full cache with 
SolrCores still referenced. To overcome this we made the overflow behaviour 
transparent by adding an additional overflowCores instance. Here we add evicted 
but still referenced cores from the transientCores cache.

Furthermore we need to ensure that only a single transientSolrCoreCache inside 
TransientSolrCoreCacheFactoryDefault is created. As we now allow multiple read 
threads, we call the the getTransientCacheHandler() method initially holding a 
write lock inside the load() method. Calling the method only needs a write lock 
initially (for cache creation). For all other calls, a read lock is sufficient. 
By default, the getTransientCacheHandler() acquires a read lock. If a write is 
needed (e.g. for core creation), the getTransientCacheHandlerInternal() is 
called. This method explicitly does not use a lock in order to provide the 
flexibility to choose between a read-lock and a write-lock. This ensures that a 
single instance of transientSolrCoreCache is created.

The lock signaling between SolrCore and CoreContainer gets replaced by a 
Condition that is tied to the write lock.

h3. Summary

This change allows for a finer grained access to the list of open SolrCores. 
The decreased blocking read access is noticeable in decreased blocking times of 
the Solr application (see screenshot).

This change has been composed together by Dennis Berger, Torsten Bøgh Köster 
and Marco Petris.


> Allow finer grained locking in SolrCores
> ----------------------------------------
>
>                 Key: SOLR-16497
>                 URL: https://issues.apache.org/jira/browse/SOLR-16497
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search
>    Affects Versions: 9.0, 8.11.2
>            Reporter: Torsten Bøgh Köster
>            Priority: Major
>         Attachments: solrcores_locking.png, solrcores_locking_fixed.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Access to loaded SolrCore instances is a synchronized read and write 
> operation in SolrCores#getCoreFromAnyList. This method is touched by every 
> request as every HTTP request is assigned the SolrCore it operates on.
> h3. Background
> Under heavy load we discovered that application halts inside of Solr are 
> becoming a serious problem in high traffic environments. Using Java Flight 
> Recordings we discovered high accumulated applications halts on the 
> modifyLock in SolrCores. In our case this means that we can only utilize our 
> machines up to 25% cpu usage. With the fix applied, a utilization up to 80% 
> is perfectly doable.
> In our case this specific locking problem was masked by another locking 
> problem in the SlowCompositeReaderWrapper. We'll submit our fix to the 
> locking problem in the SlowCompositeReaderWrapper in a following issue.
> h3. Problem
> Our Solr instances utilizes the collapse component heavily. The instances run 
> with 32 cores and 32gb Java heap on a rather small index (4gb). The instances 
> scale out at 50% cpu load. We take Java Flight Recorder snapshots of 60 
> seconds
> as soon the cpu usage exceeds 50%.
>  !solrcores_locking.png|width=1024! 
> During our 60s Java Flight Recorder snapshot, the ~2k Jetty acceptor threads 
> accumulated more than 12h locking time inside SolrCores on the modifyLock 
> instance used as synchronized lock (see screenshot). With this fix the 
> locking access is reduced to write accesses only. We validated this using 
> another JFR snapshot:
>  !solrcores_locking_fixed.png|width=1024! 
> We ran this code for a couple of weeks in our live environment.
> h3. Solution
> The synchronized modifyLock is replaced by a ReentrantReadWriteLock. This 
> allows concurrent reads from the internal SolrCore instance list (cores) but 
> grants exclusive access to write operations on the instance list (cores).
> In Solr 9.x the cache inside the TransientSolrCoreCacheFactoryDefault adds a 
> cache overflow handling of the size based internal cache (SOLR-15964). As 
> soon as SolrCores are evicted from the internal cache, the cache behaviour 
> changes from a size based cache to a reference based cache via the cache's 
> eviction handler. SolrCore instances that are still referenced are inserted 
> back into the cache. This means that write operations to the cache (insert 
> SolrCore) can be issued during read operations in SolrCores. Hence these 
> operations have only a read lock which cannot be upgraded to a write lock 
> (dead lock).
> To overcome this, we moved the cache maintenance (including the eviction 
> handler) in TransientSolrCoreCacheFactoryDefault to a separate thread. This 
> thread can acquire a write lock but on the other hand a separate thread will 
> schedule a ping-pong behaviour in the eviction handler on a full cache with 
> SolrCores still referenced. To overcome this we made the overflow behaviour 
> transparent by adding an additional overflowCores instance. Here we add 
> evicted but still referenced cores from the transientCores cache.
> Furthermore we need to ensure that only a single transientSolrCoreCache 
> inside TransientSolrCoreCacheFactoryDefault is created. As we now allow 
> multiple read threads, we call the the getTransientCacheHandler() method 
> initially holding a write lock inside the load() method. Calling the method 
> only needs a write lock initially (for cache creation). For all other calls, 
> a read lock is sufficient. By default, the getTransientCacheHandler() 
> acquires a read lock. If a write is needed (e.g. for core creation), the 
> getTransientCacheHandlerInternal() is called. This method explicitly does not 
> use a lock in order to provide the flexibility to choose between a read-lock 
> and a write-lock. This ensures that a single instance of 
> transientSolrCoreCache is created.
> The lock signaling between SolrCore and CoreContainer gets replaced by a 
> Condition that is tied to the write lock.
> h3. Summary
> This change allows for a finer grained access to the list of open SolrCores. 
> The decreased blocking read access is noticeable in decreased blocking times 
> of the Solr application (see screenshot).
> This change has been composed together by Dennis Berger, Torsten Bøgh Köster 
> and Marco Petris.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to