Re: Issue with Solr CPU Circuit Breaker Triggering Incorrectly due to Inaccurate CPU Load Metrics

Walter Underwood Tue, 29 Jul 2025 11:42:47 -0700

Looks like there have been some nice improvements in that feature since I was 
last in the code. I had recommended splitting out update and query, glad that 
got done,


CPU load is not especially useful, because it gets high after there is already 
a problem.

Load average includes processes/threads waiting to run (in most OSs), so it 
goes high before there is a problem. A good setting for load average is 
somewhere between the number of CPUs and 2X the number of CPUs (one running, 
one waiting). Some OSs include processes in IO wait, I think.

I think I explained all that when I updated the docs, but I don’t see it in a 
quick scan of the current docs.

Of course, number of CPUs isn’t a hard and fast number in containers.

There is some broken formatting at the end of this sentence "For more 
information, see the Wikipedia page for Load,”

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Jul 29, 2025, at 4:49 AM, Jason Gerlowski <gerlowsk...@gmail.com> wrote:
> 
> Hi Puneet,
> 
> It certainly looks like there are a lot of bugs in load-average
> reporting - I never realized it was so shaky in those containerized
> environments!  Thanks for the thorough writeup.
> 
> The question is what to do about it.  On the one hand "load average"
> is only one of several circuit breakers that Solr offers, and it's
> likely still providing value for folks who happen to run in
> non-containerized environments.  Maybe the best thing to do is to
> update our docs to highlight these limitations, and suggest folks
> running in Kubernetes, etc. steer clear of the load-avg circuit
> breaker?
> 
> Would you be willing to file a JIRA ticket to summarize the problem
> and propose how it might be addressed?
> 
>> OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close
>> to 1.0 (100%)*.
> 
> You may know this already, but to highlight it for others: a CPU Load
> of 1.0 doesn't imply utilization of 100%.
> 
> CPU Load, or load-average, is a measure of how many processes are
> currently using or waiting for a CPU.  It's a distinct metric from CPU
> utilization, which measures what percentage of time your CPUs are
> utilized.
> 
> So having a CPU of 1.0 and utilization of 20-30% isn't necessarily
> wrong or contradictory.  It may be correct.  (I would say "Those
> values are correct", if not for all of the issue-tracker links you
> shared above, which make a compelling theoretical case.)
> 
> Best,
> 
> Jason
> 
> On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA
> <puneetsharmaps...@gmail.com> wrote:
>> 
>> Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU load
>> metrics obtained from the Java OperatingSystemMXBean. However, in
>> environments (notably when running in cloud platforms like Google Cloud
>> Platform - GCP), this metric inaccurately reports CPU usage, causing the
>> circuit breaker to trip unnecessarily. Here is the observed issue, root
>> cause, supporting references, and a diagnostic utility used to investigate
>> the problem.Solr’s CPU circuit breaker is using
>> com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor CPU
>> usage. These metrics have been observed to return misleading values
>> 
>>   -
>> 
>>   GCP monitoring shows average Solr CPU usage around *25-30%*.
>>   -
>> 
>>   OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close
>>   to 1.0 (100%)*.
>>   -
>> 
>>   As a result, Solr’s CPU circuit breaker falsely assumes high load and
>>   prematurely *trips*, potentially impacting service availability or
>>   throttling requests unnecessarily.
>> 
>> This discrepancy arises from a change in how CPU metrics are calculated in
>> the JDK.
>> cgroup configs
>> 
>> CPUUsageNSec=378033177304000
>> CPUAccounting=yes
>> CPUWeight=[not set]
>> StartupCPUWeight=[not set]
>> CPUShares=[not set]
>> StartupCPUShares=[not set]
>> CPUQuotaPerSecUSec=infinity
>> CPUQuotaPeriodUSec=infinity
>> LimitCPU=infinity
>> LimitCPUSoft=infinity
>> CPUSchedulingPolicy=0
>> CPUSchedulingPriority=0
>> CPUAffinityFromNUMA=no
>> CPUSchedulingResetOnFork=no
>> *Relevant JDK Bugs and Fixes**JDK-8248215*
>> 
>>   -
>> 
>>   *Title*: Improve OperatingSystemMXBean API to report CPU load
>>   information for containers
>>   -
>> 
>>   *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215>
>>   -
>> 
>>   *Summary*: Introduced enhancements to better support reporting of CPU
>>   metrics inside containerized environments.
>> 
>> *JDK-8269851*
>> 
>>   -
>> 
>>   *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect value
>>   inside a container
>>   -
>> 
>>   *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851>
>>   -
>> 
>>   *Commit*: Github PR
>>   
>> <https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282>
>>   -
>> 
>>   *Impact*: Introduced changes that affect the internal behavior of
>>   getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the
>>   reported CPU usage may not correctly reflect real CPU usage inside
>>   containers.
>> 
>> 
>> To verify the discrepancy, added a class within Solr to print out real-time
>> CPU load metrics as seen by the JVM.*MonitorCpu.java*
>> 
>> // To compile:
>> // javac 
>> /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java
>> // To run:
>> // java -cp /path/to/solr/core/src/java
>> org.apache.solr.util.circuitbreaker.MonitorCpu
>> 
>> package org.apache.solr.util.circuitbreaker;
>> 
>> import com.sun.management.OperatingSystemMXBean;
>> import java.lang.management.ManagementFactory;
>> 
>> public class MonitorCpu {
>>    public static void main(String[] args) {
>>        OperatingSystemMXBean osBean =
>>            (OperatingSystemMXBean)
>> ManagementFactory.getOperatingSystemMXBean();
>> 
>>        while (true) {
>>            double cpuLoad = osBean.getSystemCpuLoad(); // or
>> getProcessCpuLoad()
>>            System.out.printf("Current CPU load: %.2f%n", cpuLoad);
>> 
>>            try {
>>                Thread.sleep(1000); // Pause to reduce output rate
>>            } catch (InterruptedException e) {
>>                Thread.currentThread().interrupt();
>>            }
>>        }
>>    }
>> }
>> *Observations from Execution*
>> 
>> 
>>   -
>> 
>>   The printed cpuLoad value often fluctuates near *1.0*, despite actual
>>   CPU load being far lower.
>>   -
>> 
>>   Confirms the mismatch between Java-reported CPU metrics and actual usage
>>   observed via system tools or GCP monitoring.
>> 
>> *Implications for Solr*
>> 
>>   -
>> 
>>   Solr's CPU circuit breaker, relying on these metrics, is *misled into
>>   believing the node is under high load*.
>>   -
>> 
>>   Can cause *premature degradation* or *request throttling*, even when
>>   system resources are sufficient.
>>   -
>> 
>>   Especially critical in *containerized* or *cloud-native* deployments
>>   (e.g., Kubernetes, GKE), where resource quotas and visibility differ from
>>   traditional environments.
>> 
>> 
>> 
>> Is anyone facing this issue in solr cpu circuit breaker ?
>> 
>> Should we change the metric used in solr circuit breakers ?
>> 
>> Can we divide the current metric by available processors to get the correct
>> value (Runtime.getRuntime().availableProcessors()) ?
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
For additional commands, e-mail: dev-h...@solr.apache.org

Re: Issue with Solr CPU Circuit Breaker Triggering Incorrectly due to Inaccurate CPU Load Metrics

Reply via email to