This i have done done on openjdk version "21.0.5" 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-21.0.5.0.11-1) (build 21.0.5+11-LTS) OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.11-1) (build 21.0.5+11-LTS, mixed mode, sharing) Aggread the JDK reports are fixed in Java 18 but seems like the issue is not yet resolved the metric is reporting the wrong value. Because I ran tests on a 4 CPU core machine where the GCP monitoring shows average CPU usage around 25-30% [no restriction on cpuQuota] and the metrics i pulled using the http://search-solr:8983/solr/admin/metrics?prefix=os.systemCpuLoad&wt=json reports the value close and solr rejects the requests with 429. On Tue, Jul 29, 2025 at 7:13 PM Gus Heck <gus.h...@gmail.com> wrote: > Looks like most of those JDK reports are fixed in Java 18. What version was > the OP on? > > On Tue, Jul 29, 2025 at 7:49 AM Jason Gerlowski <gerlowsk...@gmail.com> > wrote: > > > Hi Puneet, > > > > It certainly looks like there are a lot of bugs in load-average > > reporting - I never realized it was so shaky in those containerized > > environments! Thanks for the thorough writeup. > > > > The question is what to do about it. On the one hand "load average" > > is only one of several circuit breakers that Solr offers, and it's > > likely still providing value for folks who happen to run in > > non-containerized environments. Maybe the best thing to do is to > > update our docs to highlight these limitations, and suggest folks > > running in Kubernetes, etc. steer clear of the load-avg circuit > > breaker? > > > > Would you be willing to file a JIRA ticket to summarize the problem > > and propose how it might be addressed? > > > > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values > > *close > > > to 1.0 (100%)*. > > > > You may know this already, but to highlight it for others: a CPU Load > > of 1.0 doesn't imply utilization of 100%. > > > > CPU Load, or load-average, is a measure of how many processes are > > currently using or waiting for a CPU. It's a distinct metric from CPU > > utilization, which measures what percentage of time your CPUs are > > utilized. > > > > So having a CPU of 1.0 and utilization of 20-30% isn't necessarily > > wrong or contradictory. It may be correct. (I would say "Those > > values are correct", if not for all of the issue-tracker links you > > shared above, which make a compelling theoretical case.) > > > > Best, > > > > Jason > > > > On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA > > <puneetsharmaps...@gmail.com> wrote: > > > > > > Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU > load > > > metrics obtained from the Java OperatingSystemMXBean. However, in > > > environments (notably when running in cloud platforms like Google Cloud > > > Platform - GCP), this metric inaccurately reports CPU usage, causing > the > > > circuit breaker to trip unnecessarily. Here is the observed issue, root > > > cause, supporting references, and a diagnostic utility used to > > investigate > > > the problem.Solr’s CPU circuit breaker is using > > > com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor > > CPU > > > usage. These metrics have been observed to return misleading values > > > > > > - > > > > > > GCP monitoring shows average Solr CPU usage around *25-30%*. > > > - > > > > > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values > > *close > > > to 1.0 (100%)*. > > > - > > > > > > As a result, Solr’s CPU circuit breaker falsely assumes high load > and > > > prematurely *trips*, potentially impacting service availability or > > > throttling requests unnecessarily. > > > > > > This discrepancy arises from a change in how CPU metrics are calculated > > in > > > the JDK. > > > cgroup configs > > > > > > CPUUsageNSec=378033177304000 > > > CPUAccounting=yes > > > CPUWeight=[not set] > > > StartupCPUWeight=[not set] > > > CPUShares=[not set] > > > StartupCPUShares=[not set] > > > CPUQuotaPerSecUSec=infinity > > > CPUQuotaPeriodUSec=infinity > > > LimitCPU=infinity > > > LimitCPUSoft=infinity > > > CPUSchedulingPolicy=0 > > > CPUSchedulingPriority=0 > > > CPUAffinityFromNUMA=no > > > CPUSchedulingResetOnFork=no > > > *Relevant JDK Bugs and Fixes**JDK-8248215* > > > > > > - > > > > > > *Title*: Improve OperatingSystemMXBean API to report CPU load > > > information for containers > > > - > > > > > > *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215> > > > - > > > > > > *Summary*: Introduced enhancements to better support reporting of > CPU > > > metrics inside containerized environments. > > > > > > *JDK-8269851* > > > > > > - > > > > > > *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect > > value > > > inside a container > > > - > > > > > > *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851> > > > - > > > > > > *Commit*: Github PR > > > < > > > https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282 > > > > > > - > > > > > > *Impact*: Introduced changes that affect the internal behavior of > > > getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the > > > reported CPU usage may not correctly reflect real CPU usage inside > > > containers. > > > > > > > > > To verify the discrepancy, added a class within Solr to print out > > real-time > > > CPU load metrics as seen by the JVM.*MonitorCpu.java* > > > > > > // To compile: > > > // javac > > > /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java > > > // To run: > > > // java -cp /path/to/solr/core/src/java > > > org.apache.solr.util.circuitbreaker.MonitorCpu > > > > > > package org.apache.solr.util.circuitbreaker; > > > > > > import com.sun.management.OperatingSystemMXBean; > > > import java.lang.management.ManagementFactory; > > > > > > public class MonitorCpu { > > > public static void main(String[] args) { > > > OperatingSystemMXBean osBean = > > > (OperatingSystemMXBean) > > > ManagementFactory.getOperatingSystemMXBean(); > > > > > > while (true) { > > > double cpuLoad = osBean.getSystemCpuLoad(); // or > > > getProcessCpuLoad() > > > System.out.printf("Current CPU load: %.2f%n", cpuLoad); > > > > > > try { > > > Thread.sleep(1000); // Pause to reduce output rate > > > } catch (InterruptedException e) { > > > Thread.currentThread().interrupt(); > > > } > > > } > > > } > > > } > > > *Observations from Execution* > > > > > > > > > - > > > > > > The printed cpuLoad value often fluctuates near *1.0*, despite > actual > > > CPU load being far lower. > > > - > > > > > > Confirms the mismatch between Java-reported CPU metrics and actual > > usage > > > observed via system tools or GCP monitoring. > > > > > > *Implications for Solr* > > > > > > - > > > > > > Solr's CPU circuit breaker, relying on these metrics, is *misled > into > > > believing the node is under high load*. > > > - > > > > > > Can cause *premature degradation* or *request throttling*, even when > > > system resources are sufficient. > > > - > > > > > > Especially critical in *containerized* or *cloud-native* deployments > > > (e.g., Kubernetes, GKE), where resource quotas and visibility differ > > from > > > traditional environments. > > > > > > > > > > > > Is anyone facing this issue in solr cpu circuit breaker ? > > > > > > Should we change the metric used in solr circuit breakers ? > > > > > > Can we divide the current metric by available processors to get the > > correct > > > value (Runtime.getRuntime().availableProcessors()) ? > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org > > For additional commands, e-mail: dev-h...@solr.apache.org > > > > > > -- > http://www.needhamsoftware.com (work) > https://a.co/d/b2sZLD9 (my fantasy fiction book) >