Looks like most of those JDK reports are fixed in Java 18. What version was
the OP on?

On Tue, Jul 29, 2025 at 7:49 AM Jason Gerlowski <gerlowsk...@gmail.com>
wrote:

> Hi Puneet,
>
> It certainly looks like there are a lot of bugs in load-average
> reporting - I never realized it was so shaky in those containerized
> environments!  Thanks for the thorough writeup.
>
> The question is what to do about it.  On the one hand "load average"
> is only one of several circuit breakers that Solr offers, and it's
> likely still providing value for folks who happen to run in
> non-containerized environments.  Maybe the best thing to do is to
> update our docs to highlight these limitations, and suggest folks
> running in Kubernetes, etc. steer clear of the load-avg circuit
> breaker?
>
> Would you be willing to file a JIRA ticket to summarize the problem
> and propose how it might be addressed?
>
> > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values
> *close
> > to 1.0 (100%)*.
>
> You may know this already, but to highlight it for others: a CPU Load
> of 1.0 doesn't imply utilization of 100%.
>
> CPU Load, or load-average, is a measure of how many processes are
> currently using or waiting for a CPU.  It's a distinct metric from CPU
> utilization, which measures what percentage of time your CPUs are
> utilized.
>
> So having a CPU of 1.0 and utilization of 20-30% isn't necessarily
> wrong or contradictory.  It may be correct.  (I would say "Those
> values are correct", if not for all of the issue-tracker links you
> shared above, which make a compelling theoretical case.)
>
> Best,
>
> Jason
>
> On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA
> <puneetsharmaps...@gmail.com> wrote:
> >
> > Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU load
> > metrics obtained from the Java OperatingSystemMXBean. However, in
> > environments (notably when running in cloud platforms like Google Cloud
> > Platform - GCP), this metric inaccurately reports CPU usage, causing the
> > circuit breaker to trip unnecessarily. Here is the observed issue, root
> > cause, supporting references, and a diagnostic utility used to
> investigate
> > the problem.Solr’s CPU circuit breaker is using
> > com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor
> CPU
> > usage. These metrics have been observed to return misleading values
> >
> >    -
> >
> >    GCP monitoring shows average Solr CPU usage around *25-30%*.
> >    -
> >
> >    OperatingSystemMXBean.getSystemCpuLoad() consistently reports values
> *close
> >    to 1.0 (100%)*.
> >    -
> >
> >    As a result, Solr’s CPU circuit breaker falsely assumes high load and
> >    prematurely *trips*, potentially impacting service availability or
> >    throttling requests unnecessarily.
> >
> > This discrepancy arises from a change in how CPU metrics are calculated
> in
> > the JDK.
> > cgroup configs
> >
> > CPUUsageNSec=378033177304000
> > CPUAccounting=yes
> > CPUWeight=[not set]
> > StartupCPUWeight=[not set]
> > CPUShares=[not set]
> > StartupCPUShares=[not set]
> > CPUQuotaPerSecUSec=infinity
> > CPUQuotaPeriodUSec=infinity
> > LimitCPU=infinity
> > LimitCPUSoft=infinity
> > CPUSchedulingPolicy=0
> > CPUSchedulingPriority=0
> > CPUAffinityFromNUMA=no
> > CPUSchedulingResetOnFork=no
> > *Relevant JDK Bugs and Fixes**JDK-8248215*
> >
> >    -
> >
> >    *Title*: Improve OperatingSystemMXBean API to report CPU load
> >    information for containers
> >    -
> >
> >    *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215>
> >    -
> >
> >    *Summary*: Introduced enhancements to better support reporting of CPU
> >    metrics inside containerized environments.
> >
> > *JDK-8269851*
> >
> >    -
> >
> >    *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect
> value
> >    inside a container
> >    -
> >
> >    *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851>
> >    -
> >
> >    *Commit*: Github PR
> >    <
> https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282
> >
> >    -
> >
> >    *Impact*: Introduced changes that affect the internal behavior of
> >    getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the
> >    reported CPU usage may not correctly reflect real CPU usage inside
> >    containers.
> >
> >
> > To verify the discrepancy, added a class within Solr to print out
> real-time
> > CPU load metrics as seen by the JVM.*MonitorCpu.java*
> >
> > // To compile:
> > // javac
> /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java
> > // To run:
> > // java -cp /path/to/solr/core/src/java
> > org.apache.solr.util.circuitbreaker.MonitorCpu
> >
> > package org.apache.solr.util.circuitbreaker;
> >
> > import com.sun.management.OperatingSystemMXBean;
> > import java.lang.management.ManagementFactory;
> >
> > public class MonitorCpu {
> >     public static void main(String[] args) {
> >         OperatingSystemMXBean osBean =
> >             (OperatingSystemMXBean)
> > ManagementFactory.getOperatingSystemMXBean();
> >
> >         while (true) {
> >             double cpuLoad = osBean.getSystemCpuLoad(); // or
> > getProcessCpuLoad()
> >             System.out.printf("Current CPU load: %.2f%n", cpuLoad);
> >
> >             try {
> >                 Thread.sleep(1000); // Pause to reduce output rate
> >             } catch (InterruptedException e) {
> >                 Thread.currentThread().interrupt();
> >             }
> >         }
> >     }
> > }
> > *Observations from Execution*
> >
> >
> >    -
> >
> >    The printed cpuLoad value often fluctuates near *1.0*, despite actual
> >    CPU load being far lower.
> >    -
> >
> >    Confirms the mismatch between Java-reported CPU metrics and actual
> usage
> >    observed via system tools or GCP monitoring.
> >
> > *Implications for Solr*
> >
> >    -
> >
> >    Solr's CPU circuit breaker, relying on these metrics, is *misled into
> >    believing the node is under high load*.
> >    -
> >
> >    Can cause *premature degradation* or *request throttling*, even when
> >    system resources are sufficient.
> >    -
> >
> >    Especially critical in *containerized* or *cloud-native* deployments
> >    (e.g., Kubernetes, GKE), where resource quotas and visibility differ
> from
> >    traditional environments.
> >
> >
> >
> > Is anyone facing this issue in solr cpu circuit breaker ?
> >
> > Should we change the metric used in solr circuit breakers ?
> >
> > Can we divide the current metric by available processors to get the
> correct
> > value (Runtime.getRuntime().availableProcessors()) ?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
> For additional commands, e-mail: dev-h...@solr.apache.org
>
>

-- 
http://www.needhamsoftware.com (work)
https://a.co/d/b2sZLD9 (my fantasy fiction book)

Reply via email to