reports the value close to 1.0 and solr rejects the requests with 429 even
though the CPU actual utilization is less.

On Wed, Jul 30, 2025 at 12:04 AM PUNEET SHARMA <puneetsharmaps...@gmail.com>
wrote:

> This i have done done on
>
> openjdk version "21.0.5" 2024-10-15 LTS
>
> OpenJDK Runtime Environment (Red_Hat-21.0.5.0.11-1) (build 21.0.5+11-LTS)
>
> OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.11-1) (build 21.0.5+11-LTS,
> mixed mode, sharing)
>
>
> Aggread the JDK reports are fixed in Java 18 but seems like the issue is
> not yet resolved the metric is reporting the wrong value.
>
>
> Because I ran tests on a 4 CPU core machine where the GCP monitoring
> shows average CPU usage around 25-30% [no restriction on cpuQuota] and
> the metrics i pulled using the
>
> http://search-solr:8983/solr/admin/metrics?prefix=os.systemCpuLoad&wt=json
>
>
> reports the value close and solr rejects the requests with 429.
>
> On Tue, Jul 29, 2025 at 7:13 PM Gus Heck <gus.h...@gmail.com> wrote:
>
>> Looks like most of those JDK reports are fixed in Java 18. What version
>> was
>> the OP on?
>>
>> On Tue, Jul 29, 2025 at 7:49 AM Jason Gerlowski <gerlowsk...@gmail.com>
>> wrote:
>>
>> > Hi Puneet,
>> >
>> > It certainly looks like there are a lot of bugs in load-average
>> > reporting - I never realized it was so shaky in those containerized
>> > environments!  Thanks for the thorough writeup.
>> >
>> > The question is what to do about it.  On the one hand "load average"
>> > is only one of several circuit breakers that Solr offers, and it's
>> > likely still providing value for folks who happen to run in
>> > non-containerized environments.  Maybe the best thing to do is to
>> > update our docs to highlight these limitations, and suggest folks
>> > running in Kubernetes, etc. steer clear of the load-avg circuit
>> > breaker?
>> >
>> > Would you be willing to file a JIRA ticket to summarize the problem
>> > and propose how it might be addressed?
>> >
>> > > OperatingSystemMXBean.getSystemCpuLoad() consistently reports values
>> > *close
>> > > to 1.0 (100%)*.
>> >
>> > You may know this already, but to highlight it for others: a CPU Load
>> > of 1.0 doesn't imply utilization of 100%.
>> >
>> > CPU Load, or load-average, is a measure of how many processes are
>> > currently using or waiting for a CPU.  It's a distinct metric from CPU
>> > utilization, which measures what percentage of time your CPUs are
>> > utilized.
>> >
>> > So having a CPU of 1.0 and utilization of 20-30% isn't necessarily
>> > wrong or contradictory.  It may be correct.  (I would say "Those
>> > values are correct", if not for all of the issue-tracker links you
>> > shared above, which make a compelling theoretical case.)
>> >
>> > Best,
>> >
>> > Jason
>> >
>> > On Mon, Jul 28, 2025 at 2:47 PM PUNEET SHARMA
>> > <puneetsharmaps...@gmail.com> wrote:
>> > >
>> > > Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU
>> load
>> > > metrics obtained from the Java OperatingSystemMXBean. However, in
>> > > environments (notably when running in cloud platforms like Google
>> Cloud
>> > > Platform - GCP), this metric inaccurately reports CPU usage, causing
>> the
>> > > circuit breaker to trip unnecessarily. Here is the observed issue,
>> root
>> > > cause, supporting references, and a diagnostic utility used to
>> > investigate
>> > > the problem.Solr’s CPU circuit breaker is using
>> > > com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor
>> > CPU
>> > > usage. These metrics have been observed to return misleading values
>> > >
>> > >    -
>> > >
>> > >    GCP monitoring shows average Solr CPU usage around *25-30%*.
>> > >    -
>> > >
>> > >    OperatingSystemMXBean.getSystemCpuLoad() consistently reports
>> values
>> > *close
>> > >    to 1.0 (100%)*.
>> > >    -
>> > >
>> > >    As a result, Solr’s CPU circuit breaker falsely assumes high load
>> and
>> > >    prematurely *trips*, potentially impacting service availability or
>> > >    throttling requests unnecessarily.
>> > >
>> > > This discrepancy arises from a change in how CPU metrics are
>> calculated
>> > in
>> > > the JDK.
>> > > cgroup configs
>> > >
>> > > CPUUsageNSec=378033177304000
>> > > CPUAccounting=yes
>> > > CPUWeight=[not set]
>> > > StartupCPUWeight=[not set]
>> > > CPUShares=[not set]
>> > > StartupCPUShares=[not set]
>> > > CPUQuotaPerSecUSec=infinity
>> > > CPUQuotaPeriodUSec=infinity
>> > > LimitCPU=infinity
>> > > LimitCPUSoft=infinity
>> > > CPUSchedulingPolicy=0
>> > > CPUSchedulingPriority=0
>> > > CPUAffinityFromNUMA=no
>> > > CPUSchedulingResetOnFork=no
>> > > *Relevant JDK Bugs and Fixes**JDK-8248215*
>> > >
>> > >    -
>> > >
>> > >    *Title*: Improve OperatingSystemMXBean API to report CPU load
>> > >    information for containers
>> > >    -
>> > >
>> > >    *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215>
>> > >    -
>> > >
>> > >    *Summary*: Introduced enhancements to better support reporting of
>> CPU
>> > >    metrics inside containerized environments.
>> > >
>> > > *JDK-8269851*
>> > >
>> > >    -
>> > >
>> > >    *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect
>> > value
>> > >    inside a container
>> > >    -
>> > >
>> > >    *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851>
>> > >    -
>> > >
>> > >    *Commit*: Github PR
>> > >    <
>> >
>> https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282
>> > >
>> > >    -
>> > >
>> > >    *Impact*: Introduced changes that affect the internal behavior of
>> > >    getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the
>> > >    reported CPU usage may not correctly reflect real CPU usage inside
>> > >    containers.
>> > >
>> > >
>> > > To verify the discrepancy, added a class within Solr to print out
>> > real-time
>> > > CPU load metrics as seen by the JVM.*MonitorCpu.java*
>> > >
>> > > // To compile:
>> > > // javac
>> >
>> /path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java
>> > > // To run:
>> > > // java -cp /path/to/solr/core/src/java
>> > > org.apache.solr.util.circuitbreaker.MonitorCpu
>> > >
>> > > package org.apache.solr.util.circuitbreaker;
>> > >
>> > > import com.sun.management.OperatingSystemMXBean;
>> > > import java.lang.management.ManagementFactory;
>> > >
>> > > public class MonitorCpu {
>> > >     public static void main(String[] args) {
>> > >         OperatingSystemMXBean osBean =
>> > >             (OperatingSystemMXBean)
>> > > ManagementFactory.getOperatingSystemMXBean();
>> > >
>> > >         while (true) {
>> > >             double cpuLoad = osBean.getSystemCpuLoad(); // or
>> > > getProcessCpuLoad()
>> > >             System.out.printf("Current CPU load: %.2f%n", cpuLoad);
>> > >
>> > >             try {
>> > >                 Thread.sleep(1000); // Pause to reduce output rate
>> > >             } catch (InterruptedException e) {
>> > >                 Thread.currentThread().interrupt();
>> > >             }
>> > >         }
>> > >     }
>> > > }
>> > > *Observations from Execution*
>> > >
>> > >
>> > >    -
>> > >
>> > >    The printed cpuLoad value often fluctuates near *1.0*, despite
>> actual
>> > >    CPU load being far lower.
>> > >    -
>> > >
>> > >    Confirms the mismatch between Java-reported CPU metrics and actual
>> > usage
>> > >    observed via system tools or GCP monitoring.
>> > >
>> > > *Implications for Solr*
>> > >
>> > >    -
>> > >
>> > >    Solr's CPU circuit breaker, relying on these metrics, is *misled
>> into
>> > >    believing the node is under high load*.
>> > >    -
>> > >
>> > >    Can cause *premature degradation* or *request throttling*, even
>> when
>> > >    system resources are sufficient.
>> > >    -
>> > >
>> > >    Especially critical in *containerized* or *cloud-native*
>> deployments
>> > >    (e.g., Kubernetes, GKE), where resource quotas and visibility
>> differ
>> > from
>> > >    traditional environments.
>> > >
>> > >
>> > >
>> > > Is anyone facing this issue in solr cpu circuit breaker ?
>> > >
>> > > Should we change the metric used in solr circuit breakers ?
>> > >
>> > > Can we divide the current metric by available processors to get the
>> > correct
>> > > value (Runtime.getRuntime().availableProcessors()) ?
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscr...@solr.apache.org
>> > For additional commands, e-mail: dev-h...@solr.apache.org
>> >
>> >
>>
>> --
>> http://www.needhamsoftware.com (work)
>> https://a.co/d/b2sZLD9 (my fantasy fiction book)
>>
>

Reply via email to