Looks like there have been some nice improvements in that feature since I was last in the code. I had recommended splitting out update and query, glad that got done,
CPU load is not especially useful, because it gets high after there is already a problem. Load average includes processes/threads waiting to run (in most OSs), so it goes high before there is a problem. A good setting for load average is somewhere between the number of CPUs and 2X the number of CPUs (one running, one waiting). Some OSs include processes in IO wait, I think. I think I explained all that when I updated the docs, but I don’t see it in a quick scan of the current docs. Of course, number of CPUs isn’t a hard and fast number in containers. There is some broken formatting at the end of this sentence "For more information, see the Wikipedia page for Load,” wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Jul 29, 2025, at 4:49 AM, Jason Gerlowski <gerlowsk...@gmail.com> wrote: > > Hi Puneet, > > It certainly looks like there are a lot of bugs in load-average > reporting - I never realized it was so shaky in those containerized > environments! Thanks for the thorough writeup. > > The question is what to do about it. On the one hand "load average" > is only one of several circuit breakers that Solr offers, and it's > likely still providing value for folks who happen to run in > non-containerized environments. Maybe the best thing to do is to > update our docs to highlight these limitations, and suggest folks > running in Kubernetes, etc. steer clear of the load-avg circuit > breaker? > > Would you be willing to file a JIRA ticket to summarize the problem > and propose how it might be addressed? > >> OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close >> to 1.0 (100%)*. > > You may know this already, but to highlight it for others: a CPU Load > of 1.0 doesn't imply utilization of 100%. > > CPU Load, or load-average, is a measure of how many processes are > currently using or waiting for a CPU. It's a distinct metric from CPU > utilization, which measures what percentage of time your CPUs are > utilized. > > So having a CPU of 1.0 and utilization of 20-30% isn't necessarily > wrong or contradictory. It may be correct. (I would say "Those > values are correct", if not for all of the issue-tracker links you > shared above, which make a compelling theoretical case.) > > Best, > > Jason > > On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA > <puneetsharmaps...@gmail.com> wrote: >> >> Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU load >> metrics obtained from the Java OperatingSystemMXBean. However, in >> environments (notably when running in cloud platforms like Google Cloud >> Platform - GCP), this metric inaccurately reports CPU usage, causing the >> circuit breaker to trip unnecessarily. Here is the observed issue, root >> cause, supporting references, and a diagnostic utility used to investigate >> the problem.Solr’s CPU circuit breaker is using >> com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor CPU >> usage. These metrics have been observed to return misleading values >> >> - >> >> GCP monitoring shows average Solr CPU usage around *25-30%*. >> - >> >> OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close >> to 1.0 (100%)*. >> - >> >> As a result, Solr’s CPU circuit breaker falsely assumes high load and >> prematurely *trips*, potentially impacting service availability or >> throttling requests unnecessarily. >> >> This discrepancy arises from a change in how CPU metrics are calculated in >> the JDK. >> cgroup configs >> >> CPUUsageNSec=378033177304000 >> CPUAccounting=yes >> CPUWeight=[not set] >> StartupCPUWeight=[not set] >> CPUShares=[not set] >> StartupCPUShares=[not set] >> CPUQuotaPerSecUSec=infinity >> CPUQuotaPeriodUSec=infinity >> LimitCPU=infinity >> LimitCPUSoft=infinity >> CPUSchedulingPolicy=0 >> CPUSchedulingPriority=0 >> CPUAffinityFromNUMA=no >> CPUSchedulingResetOnFork=no >> *Relevant JDK Bugs and Fixes**JDK-8248215* >> >> - >> >> *Title*: Improve OperatingSystemMXBean API to report CPU load >> information for containers >> - >> >> *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215> >> - >> >> *Summary*: Introduced enhancements to better support reporting of CPU >> metrics inside containerized environments. >> >> *JDK-8269851* >> >> - >> >> *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect value >> inside a container >> - >> >> *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851> >> - >> >> *Commit*: Github PR >> >> <https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282> >> - >> >> *Impact*: Introduced changes that affect the internal behavior of >> getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the >> reported CPU usage may not correctly reflect real CPU usage inside >> containers. >> >> >> To verify the discrepancy, added a class within Solr to print out real-time >> CPU load metrics as seen by the JVM.*MonitorCpu.java* >> >> // To compile: >> // javac >> /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java >> // To run: >> // java -cp /path/to/solr/core/src/java >> org.apache.solr.util.circuitbreaker.MonitorCpu >> >> package org.apache.solr.util.circuitbreaker; >> >> import com.sun.management.OperatingSystemMXBean; >> import java.lang.management.ManagementFactory; >> >> public class MonitorCpu { >> public static void main(String[] args) { >> OperatingSystemMXBean osBean = >> (OperatingSystemMXBean) >> ManagementFactory.getOperatingSystemMXBean(); >> >> while (true) { >> double cpuLoad = osBean.getSystemCpuLoad(); // or >> getProcessCpuLoad() >> System.out.printf("Current CPU load: %.2f%n", cpuLoad); >> >> try { >> Thread.sleep(1000); // Pause to reduce output rate >> } catch (InterruptedException e) { >> Thread.currentThread().interrupt(); >> } >> } >> } >> } >> *Observations from Execution* >> >> >> - >> >> The printed cpuLoad value often fluctuates near *1.0*, despite actual >> CPU load being far lower. >> - >> >> Confirms the mismatch between Java-reported CPU metrics and actual usage >> observed via system tools or GCP monitoring. >> >> *Implications for Solr* >> >> - >> >> Solr's CPU circuit breaker, relying on these metrics, is *misled into >> believing the node is under high load*. >> - >> >> Can cause *premature degradation* or *request throttling*, even when >> system resources are sufficient. >> - >> >> Especially critical in *containerized* or *cloud-native* deployments >> (e.g., Kubernetes, GKE), where resource quotas and visibility differ from >> traditional environments. >> >> >> >> Is anyone facing this issue in solr cpu circuit breaker ? >> >> Should we change the metric used in solr circuit breakers ? >> >> Can we divide the current metric by available processors to get the correct >> value (Runtime.getRuntime().availableProcessors()) ? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org For additional commands, e-mail: dev-h...@solr.apache.org