Hi Team,Currently Solr's CPU circuit breaker mechanism relies on CPU load
metrics obtained from the Java OperatingSystemMXBean. However, in
environments (notably when running in cloud platforms like Google Cloud
Platform - GCP), this metric inaccurately reports CPU usage, causing the
circuit breaker to trip unnecessarily. Here is the observed issue, root
cause, supporting references, and a diagnostic utility used to investigate
the problem.Solr’s CPU circuit breaker is using
com.sun.management.OperatingSystemMXBean.getSystemCpuLoad() to monitor CPU
usage. These metrics have been observed to return misleading values

   -

   GCP monitoring shows average Solr CPU usage around *25-30%*.
   -

   OperatingSystemMXBean.getSystemCpuLoad() consistently reports values *close
   to 1.0 (100%)*.
   -

   As a result, Solr’s CPU circuit breaker falsely assumes high load and
   prematurely *trips*, potentially impacting service availability or
   throttling requests unnecessarily.

This discrepancy arises from a change in how CPU metrics are calculated in
the JDK.
cgroup configs

CPUUsageNSec=378033177304000
CPUAccounting=yes
CPUWeight=[not set]
StartupCPUWeight=[not set]
CPUShares=[not set]
StartupCPUShares=[not set]
CPUQuotaPerSecUSec=infinity
CPUQuotaPeriodUSec=infinity
LimitCPU=infinity
LimitCPUSoft=infinity
CPUSchedulingPolicy=0
CPUSchedulingPriority=0
CPUAffinityFromNUMA=no
CPUSchedulingResetOnFork=no
*Relevant JDK Bugs and Fixes**JDK-8248215*

   -

   *Title*: Improve OperatingSystemMXBean API to report CPU load
   information for containers
   -

   *Link*: JDK-8248215 <https://bugs.openjdk.org/browse/JDK-8248215>
   -

   *Summary*: Introduced enhancements to better support reporting of CPU
   metrics inside containerized environments.

*JDK-8269851*

   -

   *Title*: OperatingSystemMXBean getSystemCpuLoad reports incorrect value
   inside a container
   -

   *Link*: JDK-8269851 <https://bugs.openjdk.org/browse/JDK-8269851>
   -

   *Commit*: Github PR
   
<https://github.com/openjdk/jdk/commit/25f00d787cf56f6cdca6949115d04e7d8e675554#diff-2bc4c3408fc6fae6e133b8ffd644b933dcbe372cf249547d4c49ed94444c9735R45-R282>
   -

   *Impact*: Introduced changes that affect the internal behavior of
   getSystemCpuLoad() and getProcessCpuLoad(). Post this change, the
   reported CPU usage may not correctly reflect real CPU usage inside
   containers.


To verify the discrepancy, added a class within Solr to print out real-time
CPU load metrics as seen by the JVM.*MonitorCpu.java*

// To compile:
// javac 
/path/to/solr/core/src/java/org/apache/solr/util/circuitbreaker/MonitorCpu.java
// To run:
// java -cp /path/to/solr/core/src/java
org.apache.solr.util.circuitbreaker.MonitorCpu

package org.apache.solr.util.circuitbreaker;

import com.sun.management.OperatingSystemMXBean;
import java.lang.management.ManagementFactory;

public class MonitorCpu {
    public static void main(String[] args) {
        OperatingSystemMXBean osBean =
            (OperatingSystemMXBean)
ManagementFactory.getOperatingSystemMXBean();

        while (true) {
            double cpuLoad = osBean.getSystemCpuLoad(); // or
getProcessCpuLoad()
            System.out.printf("Current CPU load: %.2f%n", cpuLoad);

            try {
                Thread.sleep(1000); // Pause to reduce output rate
            } catch (InterruptedException e) {
                Thread.currentThread().interrupt();
            }
        }
    }
}
*Observations from Execution*


   -

   The printed cpuLoad value often fluctuates near *1.0*, despite actual
   CPU load being far lower.
   -

   Confirms the mismatch between Java-reported CPU metrics and actual usage
   observed via system tools or GCP monitoring.

*Implications for Solr*

   -

   Solr's CPU circuit breaker, relying on these metrics, is *misled into
   believing the node is under high load*.
   -

   Can cause *premature degradation* or *request throttling*, even when
   system resources are sufficient.
   -

   Especially critical in *containerized* or *cloud-native* deployments
   (e.g., Kubernetes, GKE), where resource quotas and visibility differ from
   traditional environments.



Is anyone facing this issue in solr cpu circuit breaker ?

Should we change the metric used in solr circuit breakers ?

Can we divide the current metric by available processors to get the correct
value (Runtime.getRuntime().availableProcessors()) ?

Reply via email to