sigram commented on a change in pull request #1737:
URL: https://github.com/apache/lucene-solr/pull/1737#discussion_r472060120



##########
File path: solr/core/src/java/org/apache/solr/core/SolrConfig.java
##########
@@ -229,9 +229,13 @@ private SolrConfig(SolrResourceLoader loader, String name, 
boolean isConfigsetTr
     enableLazyFieldLoading = getBool("query/enableLazyFieldLoading", false);
 
     useCircuitBreakers = getBool("circuitBreaker/useCircuitBreakers", false);
+    cpuCircuitBreakerEnabled = 
getBool("circuitBreaker/cpuCircuitBreakerEnabled", false);

Review comment:
       Can we please simplify these names? they are awfully verbose and 
repeating the parts that are already unique and obvious.

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
##########
@@ -88,13 +95,19 @@ public boolean isTripped() {
 
   @Override
   public String getDebugInfo() {
-    if (seenMemory.get() == 0L || allowedMemory.get() == 0L) {
+    if (seenMemory.get() == 0.0 || allowedMemory.get() == 0.0) {
       log.warn("MemoryCircuitBreaker's monitored values (seenMemory, 
allowedMemory) not set");
     }
 
     return "seenMemory=" + seenMemory.get() + " allowedMemory=" + 
allowedMemory.get();
   }
 
+  @Override
+  public String getErrorMessage() {
+    return "Memory Circuit Breaker Triggered. Seen JVM heap memory usage " + 
seenMemory.get() + " and allocated threshold " +

Review comment:
       Similarly, "greater than allocated threshold" ?

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
##########
@@ -0,0 +1,108 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.management.ManagementFactory;
+import java.lang.management.OperatingSystemMXBean;
+
+import org.apache.solr.core.SolrConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * <p>
+ * Tracks current CPU usage and triggers if the specified threshold is 
breached.
+ *
+ * This circuit breaker gets the average CPU load over the last minute and uses
+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will 
introduce
+ * more complexity than the current structure and might not get us major 
performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.

Review comment:
       Uh ... I see there's still some misunderstanding about this. The call 
itself is directly passed to the native method that invokes stdlib 
`getloadavg`, which in turn reads these values from the /proc pseudo-fs. So, 
the cost is truly minimal and the call doesn't block - if it turns out that 
it's still too costly to call for every request then we can introduce some 
timeout-based caching.
   
   These averages are so called exponentially weighted moving averages, so 
indeed a 1-min average has traces of past load values from outside the 1-min 
window, which helps in smoothing it. This may turn out to be sufficient to 
avoid false positives due to short-term spikes (such as large merges). Linux 
loadavg represents to some degree a combined CPU + disk IO load, so indeed 
intensive IO operations will affect it.
   
   We always have an option to use Codahale Meter to easily calculate 5- and 
15-min EWMAs if it turns out that we're getting too many false positives. Until 
then users can configure higher thresholds, thus reducing the number of false 
positives at the cost of higher contention.

##########
File path: solr/solr-ref-guide/src/circuit-breakers.adoc
##########
@@ -35,33 +35,68 @@ will be disabled globally. Per circuit breaker 
configurations are specified in t
 <useCircuitBreakers>false</useCircuitBreakers>
 ----
 
+This flag acts as the highest authority and global controller of circuit 
breakers. For using specific circuit breakers, each one
+needs to be individually enabled in addition to this flag being enabled.

Review comment:
       If we change the format of the config this section needs to be updated 
too.

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java
##########
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.solr.util.circuitbreaker;
+
+import java.lang.invoke.MethodHandles;
+import java.lang.management.ManagementFactory;
+import java.lang.management.OperatingSystemMXBean;
+
+import org.apache.solr.core.SolrConfig;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * <p>
+ * Tracks current CPU usage and triggers if the specified threshold is 
breached.
+ *
+ * This circuit breaker gets the average CPU load over the last minute and uses
+ * that data to take a decision. Ideally, we should be able to cache the value
+ * locally and only query once the minute has elapsed. However, that will 
introduce
+ * more complexity than the current structure and might not get us major 
performance
+ * wins. If this ever becomes a performance bottleneck, that can be considered.
+ * </p>
+ *
+ * <p>
+ * The configuration to define which mode to use and the trigger threshold are 
defined in
+ * solrconfig.xml
+ * </p>
+ */
+public class CPUCircuitBreaker extends CircuitBreaker {
+  private static final Logger log = 
LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());
+  private static final OperatingSystemMXBean operatingSystemMXBean = 
ManagementFactory.getOperatingSystemMXBean();
+
+  private final boolean enabled;
+  private final double cpuUsageThreshold;
+
+  // Assumption -- the value of these parameters will be set correctly before 
invoking getDebugInfo()
+  private static final ThreadLocal<Double> seenCPUUsage = 
ThreadLocal.withInitial(() -> 0.0);
+
+  private static final ThreadLocal<Double> allowedCPUUsage = 
ThreadLocal.withInitial(() -> 0.0);
+
+  public CPUCircuitBreaker(SolrConfig solrConfig) {
+    super(solrConfig);
+
+    this.enabled = solrConfig.cpuCircuitBreakerEnabled;
+    this.cpuUsageThreshold = solrConfig.cpuCircuitBreakerThreshold;
+  }
+
+  @Override
+  public boolean isTripped() {
+    if (!isEnabled()) {
+      return false;
+    }
+
+    if (!enabled) {
+      return false;
+    }
+
+    double localAllowedCPUUsage = getCpuUsageThreshold();
+    double localSeenCPUUsage = calculateLiveCPUUsage();
+
+    if (localSeenCPUUsage < 0) {
+      if (log.isWarnEnabled()) {
+        String msg = "Unable to get CPU usage";
+
+        log.warn(msg);
+      }
+
+      return false;
+    }
+
+    allowedCPUUsage.set(localAllowedCPUUsage);
+
+    seenCPUUsage.set(localSeenCPUUsage);
+
+    return (localSeenCPUUsage >= localAllowedCPUUsage);
+  }
+
+  @Override
+  public String getDebugInfo() {
+
+    if (seenCPUUsage.get() == 0.0 || seenCPUUsage.get() == 0.0) {
+      log.warn("CPUCircuitBreaker's monitored values (seenCPUUSage, 
allowedCPUUsage) not set");
+    }
+
+    return "seenCPUUSage=" + seenCPUUsage.get() + " allowedCPUUsage=" + 
allowedCPUUsage.get();
+  }
+
+  @Override
+  public String getErrorMessage() {
+    return "CPU Circuit Breaker Triggered. Seen CPU usage " + 
seenCPUUsage.get() + " and allocated threshold " +

Review comment:
       Maybe "greater than allocated threshold" ?

##########
File path: 
solr/core/src/test-files/solr/collection1/conf/solrconfig-memory-circuitbreaker.xml
##########
@@ -82,8 +82,14 @@
 
     <useCircuitBreakers>true</useCircuitBreakers>

Review comment:
       I hate typing too much :)  and overly verbose names are often easier to 
mistype.
   
   What do you think about this (note the plural `circuitBreakers` - we have 
more than one now!):
   ```
   <circuitBreakers enabled="true">
    <cpuBreaker enabled="true" threshold="75"/>
    <memBreaker enabled="true" threshold="75"/>
   ...
   </circuitBreakers>
   ```

##########
File path: 
solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java
##########
@@ -76,6 +79,10 @@ public boolean isTripped() {
       return false;
     }
 
+    if (!enabled) {
+      return false;
+    }
+
     long localAllowedMemory = getCurrentMemoryThreshold();

Review comment:
       The same comment applies as in the CPU breaker regarding caching vs. 
calling this for every request. The cost should be minimal, it's just making a 
fast native call to read a value.

##########
File path: solr/solr-ref-guide/src/circuit-breakers.adoc
##########
@@ -35,33 +35,68 @@ will be disabled globally. Per circuit breaker 
configurations are specified in t
 <useCircuitBreakers>false</useCircuitBreakers>
 ----
 
+This flag acts as the highest authority and global controller of circuit 
breakers. For using specific circuit breakers, each one
+needs to be individually enabled in addition to this flag being enabled.
+
 == Currently Supported Circuit Breakers
 
 === JVM Heap Usage Based Circuit Breaker
 This circuit breaker tracks JVM heap memory usage and rejects incoming search 
requests with a 503 error code if the heap usage
 exceeds a configured percentage of maximum heap allocated to the JVM (-Xmx). 
The main configuration for this circuit breaker is
 controlling the threshold percentage at which the breaker will trip.
 
-It does not logically make sense to have a threshold below 50% and above 95% 
of the max heap allocated to the JVM. Hence, the range
-of valid values for this parameter is [50, 95], both inclusive.
+To enable/disable JVM heap usage based circuit breaker, use the following 
configuration:
+
+[source,xml]
+----
+<memoryCircuitBreakerEnabled>true</memoryCircuitBreakerEnabled>
+----
+
+Note that this configuration will be overridden by the global circuit breaker 
flag -- if circuit breakers are disabled, this flag
+will not help you.
+
+To set the triggering threshold as a percentage of the max heap allocated to 
the JVM, use the following parameter.
 
 [source,xml]
 ----
 <memoryCircuitBreakerThresholdPct>75</memoryCircuitBreakerThresholdPct>
 ----
+It does not logically make sense to have a threshold below 50% and above 95% 
of the max heap allocated to the JVM. Hence, the range
+of valid values for this parameter is [50, 95], both inclusive.
 
 Consider the following example:
 
 JVM has been allocated a maximum heap of 5GB (-Xmx) and 
memoryCircuitBreakerThresholdPct is set to 75. In this scenario, the heap usage
 at which the circuit breaker will trip is 3.75GB.
 
-Note that this circuit breaker is checked for each incoming search request and 
considers the current heap usage of the node, i.e every search
-request will get the live heap usage and compare it against the set memory 
threshold. The check does not impact performance,
-but any performance regressions that are suspected to be caused by this 
feature should be reported to the dev list.
 
+=== CPU Utilization Based Circuit Breaker
+This circuit breaker tracks CPU utilization and triggers if the average CPU 
utilization over the last one minute
+exceeds a configurable threshold. Note that the value used in computation is 
over the last one minute -- so a sudden
+spike in traffic that goes down might still cause the circuit breaker to 
trigger for a short while before it resolves
+and updates the value. For more details of the calculation, please see 
https://en.wikipedia.org/wiki/Load_(computing)
+
+To enable/disable CPU utilization based circuit breaker, use the following 
configuration:
+
+[source,xml]
+----
+<cpuCircuitBreakerEnabled>true</cpuCircuitBreakerEnabled>
+----
+
+Note that this configuration will be overridden by the global circuit breaker 
flag -- if circuit breakers are disabled, this flag
+will not help you.
+
+To set the triggering threshold in units of CPU utilization, use the following 
parameter.
+[source,xml]
+----
+<cpuCircuitBreakerThreshold>75</cpuCircuitBreakerThreshold>

Review comment:
       The value of 75 would rarely make sense for a production system, maybe 
change this to 20?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to