reports the value close to 1.0 and solr rejects the requests with 429 even though the CPU actual utilization is less.
On Wed, Jul 30, 2025 at 12:04 AM PUNEET SHARMA <puneetsharmaps...@gmail.com> wrote: > This i have done done on > > openjdk version "21.0.5" 2024-10-15 LTS > > OpenJDK Runtime Environment (Red_Hat-21.0.5.0.11-1) (build 21.0.5+11-LTS) > > OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.11-1) (build 21.0.5+11-LTS, > mixed mode, sharing) > > > Aggread the JDK reports are fixed in Java 18 but seems like the issue is > not yet resolved the metric is reporting the wrong value. > > > Because I ran tests on a 4 CPU core machine where the GCP monitoring > shows average CPU usage around 25-30% [no restriction on cpuQuota] and > the metrics i pulled using the > > http://search-solr:8983/solr/admin/metrics?prefix=os.systemCpuLoad&wt=json > > > reports the value close and solr rejects the requests with 429. > > On Tue, Jul 29, 2025 at 7:13 PM Gus Heck <gus.h...@gmail.com> wrote: > >> Looks like most of those JDK reports are fixed in Java 18. What version >> was >> the OP on? >> >> On Tue, Jul 29, 2025 at 7:49 AM Jason Gerlowski <gerlowsk...@gmail.com> >> wrote: >> >> > Hi Puneet, >> > >> > It certainly looks like there are a lot of bugs in load-average >> > reporting - I never realized it was so shaky in those containerized >> > environments! Thanks for the thorough writeup. >> > >> > The question is what to do about it. On the one hand "load average" >> > is only one of several circuit breakers that Solr offers, and it's >> > likely still providing value for folks who happen to run in >> > non-containerized environments. Maybe the best thing to do is to >> > update our docs to highlight these limitations, and suggest folks >> > running in Kubernetes, etc. steer clear of the load-avg circuit >> > breaker? >> > >> > Would you be willing to file a JIRA ticket to summarize the problem >> > and propose how it might be addressed? >> > >> > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values >> > *close >> > > to 1.0 (100%)*. >> > >> > You may know this already, but to highlight it for others: a CPU Load >> > of 1.0 doesn't imply utilization of 100%. >> > >> > CPU Load, or load-average, is a measure of how many processes are >> > currently using or waiting for a CPU. It's a distinct metric from CPU >> > utilization, which measures what percentage of time your CPUs are >> > utilized. >> > >> > So having a CPU of 1.0 and utilization of 20-30% isn't necessarily >> > wrong or contradictory. It may be correct. (I would say "Those >> > values are correct", if not for all of the issue-tracker links you >> > shared above, which make a compelling theoretical case.) >> > >> > Best, >> > >> > Jason >> > >> > On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA >> > <puneetsharmaps...@gmail.com> wrote: >> > > >> > > Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU >> load >> > > metrics obtained from the Java OperatingSystemMXBean. However, in >> > > environments (notably when running in cloud platforms like Google >> Cloud >> > > Platform - GCP), this metric inaccurately reports CPU usage, causing >> the >> > > circuit breaker to trip unnecessarily. Here is the observed issue, >> root >> > > cause, supporting references, and a diagnostic utility used to >> > investigate >> > > the problem.Solr’s CPU circuit breaker is using >> > > com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor >> > CPU >> > > usage. These metrics have been observed to return misleading values >> > > >> > > - >> > > >> > > GCP monitoring shows average Solr CPU usage around *25-30%*. >> > > - >> > > >> > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports >> values >> > *close >> > > to 1.0 (100%)*. >> > > - >> > > >> > > As a result, Solr’s CPU circuit breaker falsely assumes high load >> and >> > > prematurely *trips*, potentially impacting service availability or >> > > throttling requests unnecessarily. >> > > >> > > This discrepancy arises from a change in how CPU metrics are >> calculated >> > in >> > > the JDK. >> > > cgroup configs >> > > >> > > CPUUsageNSec=378033177304000 >> > > CPUAccounting=yes >> > > CPUWeight=[not set] >> > > StartupCPUWeight=[not set] >> > > CPUShares=[not set] >> > > StartupCPUShares=[not set] >> > > CPUQuotaPerSecUSec=infinity >> > > CPUQuotaPeriodUSec=infinity >> > > LimitCPU=infinity >> > > LimitCPUSoft=infinity >> > > CPUSchedulingPolicy=0 >> > > CPUSchedulingPriority=0 >> > > CPUAffinityFromNUMA=no >> > > CPUSchedulingResetOnFork=no >> > > *Relevant JDK Bugs and Fixes**JDK-8248215* >> > > >> > > - >> > > >> > > *Title*: Improve OperatingSystemMXBean API to report CPU load >> > > information for containers >> > > - >> > > >> > > *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215> >> > > - >> > > >> > > *Summary*: Introduced enhancements to better support reporting of >> CPU >> > > metrics inside containerized environments. >> > > >> > > *JDK-8269851* >> > > >> > > - >> > > >> > > *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect >> > value >> > > inside a container >> > > - >> > > >> > > *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851> >> > > - >> > > >> > > *Commit*: Github PR >> > > < >> > >> https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282 >> > > >> > > - >> > > >> > > *Impact*: Introduced changes that affect the internal behavior of >> > > getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the >> > > reported CPU usage may not correctly reflect real CPU usage inside >> > > containers. >> > > >> > > >> > > To verify the discrepancy, added a class within Solr to print out >> > real-time >> > > CPU load metrics as seen by the JVM.*MonitorCpu.java* >> > > >> > > // To compile: >> > > // javac >> > >> /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java >> > > // To run: >> > > // java -cp /path/to/solr/core/src/java >> > > org.apache.solr.util.circuitbreaker.MonitorCpu >> > > >> > > package org.apache.solr.util.circuitbreaker; >> > > >> > > import com.sun.management.OperatingSystemMXBean; >> > > import java.lang.management.ManagementFactory; >> > > >> > > public class MonitorCpu { >> > > public static void main(String[] args) { >> > > OperatingSystemMXBean osBean = >> > > (OperatingSystemMXBean) >> > > ManagementFactory.getOperatingSystemMXBean(); >> > > >> > > while (true) { >> > > double cpuLoad = osBean.getSystemCpuLoad(); // or >> > > getProcessCpuLoad() >> > > System.out.printf("Current CPU load: %.2f%n", cpuLoad); >> > > >> > > try { >> > > Thread.sleep(1000); // Pause to reduce output rate >> > > } catch (InterruptedException e) { >> > > Thread.currentThread().interrupt(); >> > > } >> > > } >> > > } >> > > } >> > > *Observations from Execution* >> > > >> > > >> > > - >> > > >> > > The printed cpuLoad value often fluctuates near *1.0*, despite >> actual >> > > CPU load being far lower. >> > > - >> > > >> > > Confirms the mismatch between Java-reported CPU metrics and actual >> > usage >> > > observed via system tools or GCP monitoring. >> > > >> > > *Implications for Solr* >> > > >> > > - >> > > >> > > Solr's CPU circuit breaker, relying on these metrics, is *misled >> into >> > > believing the node is under high load*. >> > > - >> > > >> > > Can cause *premature degradation* or *request throttling*, even >> when >> > > system resources are sufficient. >> > > - >> > > >> > > Especially critical in *containerized* or *cloud-native* >> deployments >> > > (e.g., Kubernetes, GKE), where resource quotas and visibility >> differ >> > from >> > > traditional environments. >> > > >> > > >> > > >> > > Is anyone facing this issue in solr cpu circuit breaker ? >> > > >> > > Should we change the metric used in solr circuit breakers ? >> > > >> > > Can we divide the current metric by available processors to get the >> > correct >> > > value (Runtime.getRuntime().availableProcessors()) ? >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org >> > For additional commands, e-mail: dev-h...@solr.apache.org >> > >> > >> >> -- >> http://www.needhamsoftware.com (work) >> https://a.co/d/b2sZLD9 (my fantasy fiction book) >> >