Looks like most of those JDK reports are fixed in Java 18. What version was the OP on?
On Tue, Jul 29, 2025 at 7:49 AM Jason Gerlowski <gerlowsk...@gmail.com> wrote: > Hi Puneet, > > It certainly looks like there are a lot of bugs in load-average > reporting - I never realized it was so shaky in those containerized > environments! Thanks for the thorough writeup. > > The question is what to do about it. On the one hand "load average" > is only one of several circuit breakers that Solr offers, and it's > likely still providing value for folks who happen to run in > non-containerized environments. Maybe the best thing to do is to > update our docs to highlight these limitations, and suggest folks > running in Kubernetes, etc. steer clear of the load-avg circuit > breaker? > > Would you be willing to file a JIRA ticket to summarize the problem > and propose how it might be addressed? > > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values > *close > > to 1.0 (100%)*. > > You may know this already, but to highlight it for others: a CPU Load > of 1.0 doesn't imply utilization of 100%. > > CPU Load, or load-average, is a measure of how many processes are > currently using or waiting for a CPU. It's a distinct metric from CPU > utilization, which measures what percentage of time your CPUs are > utilized. > > So having a CPU of 1.0 and utilization of 20-30% isn't necessarily > wrong or contradictory. It may be correct. (I would say "Those > values are correct", if not for all of the issue-tracker links you > shared above, which make a compelling theoretical case.) > > Best, > > Jason > > On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA > <puneetsharmaps...@gmail.com> wrote: > > > > Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU load > > metrics obtained from the Java OperatingSystemMXBean. However, in > > environments (notably when running in cloud platforms like Google Cloud > > Platform - GCP), this metric inaccurately reports CPU usage, causing the > > circuit breaker to trip unnecessarily. Here is the observed issue, root > > cause, supporting references, and a diagnostic utility used to > investigate > > the problem.Solr’s CPU circuit breaker is using > > com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor > CPU > > usage. These metrics have been observed to return misleading values > > > > - > > > > GCP monitoring shows average Solr CPU usage around *25-30%*. > > - > > > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values > *close > > to 1.0 (100%)*. > > - > > > > As a result, Solr’s CPU circuit breaker falsely assumes high load and > > prematurely *trips*, potentially impacting service availability or > > throttling requests unnecessarily. > > > > This discrepancy arises from a change in how CPU metrics are calculated > in > > the JDK. > > cgroup configs > > > > CPUUsageNSec=378033177304000 > > CPUAccounting=yes > > CPUWeight=[not set] > > StartupCPUWeight=[not set] > > CPUShares=[not set] > > StartupCPUShares=[not set] > > CPUQuotaPerSecUSec=infinity > > CPUQuotaPeriodUSec=infinity > > LimitCPU=infinity > > LimitCPUSoft=infinity > > CPUSchedulingPolicy=0 > > CPUSchedulingPriority=0 > > CPUAffinityFromNUMA=no > > CPUSchedulingResetOnFork=no > > *Relevant JDK Bugs and Fixes**JDK-8248215* > > > > - > > > > *Title*: Improve OperatingSystemMXBean API to report CPU load > > information for containers > > - > > > > *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215> > > - > > > > *Summary*: Introduced enhancements to better support reporting of CPU > > metrics inside containerized environments. > > > > *JDK-8269851* > > > > - > > > > *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect > value > > inside a container > > - > > > > *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851> > > - > > > > *Commit*: Github PR > > < > https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282 > > > > - > > > > *Impact*: Introduced changes that affect the internal behavior of > > getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the > > reported CPU usage may not correctly reflect real CPU usage inside > > containers. > > > > > > To verify the discrepancy, added a class within Solr to print out > real-time > > CPU load metrics as seen by the JVM.*MonitorCpu.java* > > > > // To compile: > > // javac > /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java > > // To run: > > // java -cp /path/to/solr/core/src/java > > org.apache.solr.util.circuitbreaker.MonitorCpu > > > > package org.apache.solr.util.circuitbreaker; > > > > import com.sun.management.OperatingSystemMXBean; > > import java.lang.management.ManagementFactory; > > > > public class MonitorCpu { > > public static void main(String[] args) { > > OperatingSystemMXBean osBean = > > (OperatingSystemMXBean) > > ManagementFactory.getOperatingSystemMXBean(); > > > > while (true) { > > double cpuLoad = osBean.getSystemCpuLoad(); // or > > getProcessCpuLoad() > > System.out.printf("Current CPU load: %.2f%n", cpuLoad); > > > > try { > > Thread.sleep(1000); // Pause to reduce output rate > > } catch (InterruptedException e) { > > Thread.currentThread().interrupt(); > > } > > } > > } > > } > > *Observations from Execution* > > > > > > - > > > > The printed cpuLoad value often fluctuates near *1.0*, despite actual > > CPU load being far lower. > > - > > > > Confirms the mismatch between Java-reported CPU metrics and actual > usage > > observed via system tools or GCP monitoring. > > > > *Implications for Solr* > > > > - > > > > Solr's CPU circuit breaker, relying on these metrics, is *misled into > > believing the node is under high load*. > > - > > > > Can cause *premature degradation* or *request throttling*, even when > > system resources are sufficient. > > - > > > > Especially critical in *containerized* or *cloud-native* deployments > > (e.g., Kubernetes, GKE), where resource quotas and visibility differ > from > > traditional environments. > > > > > > > > Is anyone facing this issue in solr cpu circuit breaker ? > > > > Should we change the metric used in solr circuit breakers ? > > > > Can we divide the current metric by available processors to get the > correct > > value (Runtime.getRuntime().availableProcessors()) ? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > For additional commands, e-mail: dev-h...@solr.apache.org > > -- http://www.needhamsoftware.com (work) https://a.co/d/b2sZLD9 (my fantasy fiction book)