Hi Puneet, It certainly looks like there are a lot of bugs in load-average reporting - I never realized it was so shaky in those containerized environments! Thanks for the thorough writeup.
The question is what to do about it. On the one hand "load average" is only one of several circuit breakers that Solr offers, and it's likely still providing value for folks who happen to run in non-containerized environments. Maybe the best thing to do is to update our docs to highlight these limitations, and suggest folks running in Kubernetes, etc. steer clear of the load-avg circuit breaker? Would you be willing to file a JIRA ticket to summarize the problem and propose how it might be addressed? > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close > to 1.0 (100%)*. You may know this already, but to highlight it for others: a CPU Load of 1.0 doesn't imply utilization of 100%. CPU Load, or load-average, is a measure of how many processes are currently using or waiting for a CPU. It's a distinct metric from CPU utilization, which measures what percentage of time your CPUs are utilized. So having a CPU of 1.0 and utilization of 20-30% isn't necessarily wrong or contradictory. It may be correct. (I would say "Those values are correct", if not for all of the issue-tracker links you shared above, which make a compelling theoretical case.) Best, Jason On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA <puneetsharmaps...@gmail.com> wrote: > > Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU load > metrics obtained from the Java OperatingSystemMXBean. However, in > environments (notably when running in cloud platforms like Google Cloud > Platform - GCP), this metric inaccurately reports CPU usage, causing the > circuit breaker to trip unnecessarily. Here is the observed issue, root > cause, supporting references, and a diagnostic utility used to investigate > the problem.Solr’s CPU circuit breaker is using > com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor CPU > usage. These metrics have been observed to return misleading values > > - > > GCP monitoring shows average Solr CPU usage around *25-30%*. > - > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close > to 1.0 (100%)*. > - > > As a result, Solr’s CPU circuit breaker falsely assumes high load and > prematurely *trips*, potentially impacting service availability or > throttling requests unnecessarily. > > This discrepancy arises from a change in how CPU metrics are calculated in > the JDK. > cgroup configs > > CPUUsageNSec=378033177304000 > CPUAccounting=yes > CPUWeight=[not set] > StartupCPUWeight=[not set] > CPUShares=[not set] > StartupCPUShares=[not set] > CPUQuotaPerSecUSec=infinity > CPUQuotaPeriodUSec=infinity > LimitCPU=infinity > LimitCPUSoft=infinity > CPUSchedulingPolicy=0 > CPUSchedulingPriority=0 > CPUAffinityFromNUMA=no > CPUSchedulingResetOnFork=no > *Relevant JDK Bugs and Fixes**JDK-8248215* > > - > > *Title*: Improve OperatingSystemMXBean API to report CPU load > information for containers > - > > *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215> > - > > *Summary*: Introduced enhancements to better support reporting of CPU > metrics inside containerized environments. > > *JDK-8269851* > > - > > *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect value > inside a container > - > > *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851> > - > > *Commit*: Github PR > > <https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282> > - > > *Impact*: Introduced changes that affect the internal behavior of > getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the > reported CPU usage may not correctly reflect real CPU usage inside > containers. > > > To verify the discrepancy, added a class within Solr to print out real-time > CPU load metrics as seen by the JVM.*MonitorCpu.java* > > // To compile: > // javac > /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java > // To run: > // java -cp /path/to/solr/core/src/java > org.apache.solr.util.circuitbreaker.MonitorCpu > > package org.apache.solr.util.circuitbreaker; > > import com.sun.management.OperatingSystemMXBean; > import java.lang.management.ManagementFactory; > > public class MonitorCpu { > public static void main(String[] args) { > OperatingSystemMXBean osBean = > (OperatingSystemMXBean) > ManagementFactory.getOperatingSystemMXBean(); > > while (true) { > double cpuLoad = osBean.getSystemCpuLoad(); // or > getProcessCpuLoad() > System.out.printf("Current CPU load: %.2f%n", cpuLoad); > > try { > Thread.sleep(1000); // Pause to reduce output rate > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > } > } > } > } > *Observations from Execution* > > > - > > The printed cpuLoad value often fluctuates near *1.0*, despite actual > CPU load being far lower. > - > > Confirms the mismatch between Java-reported CPU metrics and actual usage > observed via system tools or GCP monitoring. > > *Implications for Solr* > > - > > Solr's CPU circuit breaker, relying on these metrics, is *misled into > believing the node is under high load*. > - > > Can cause *premature degradation* or *request throttling*, even when > system resources are sufficient. > - > > Especially critical in *containerized* or *cloud-native* deployments > (e.g., Kubernetes, GKE), where resource quotas and visibility differ from > traditional environments. > > > > Is anyone facing this issue in solr cpu circuit breaker ? > > Should we change the metric used in solr circuit breakers ? > > Can we divide the current metric by available processors to get the correct > value (Runtime.getRuntime().availableProcessors()) ? --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org