sigram commented on a change in pull request #1737: URL: https://github.com/apache/lucene-solr/pull/1737#discussion_r472060120
########## File path: solr/core/src/java/org/apache/solr/core/SolrConfig.java ########## @@ -229,9 +229,13 @@ private SolrConfig(SolrResourceLoader loader, String name, boolean isConfigsetTr enableLazyFieldLoading = getBool("query/enableLazyFieldLoading", false); useCircuitBreakers = getBool("circuitBreaker/useCircuitBreakers", false); + cpuCircuitBreakerEnabled = getBool("circuitBreaker/cpuCircuitBreakerEnabled", false); Review comment: Can we please simplify these names? they are awfully verbose and repeating the parts that are already unique and obvious. ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java ########## @@ -88,13 +95,19 @@ public boolean isTripped() { @Override public String getDebugInfo() { - if (seenMemory.get() == 0L || allowedMemory.get() == 0L) { + if (seenMemory.get() == 0.0 || allowedMemory.get() == 0.0) { log.warn("MemoryCircuitBreaker's monitored values (seenMemory, allowedMemory) not set"); } return "seenMemory=" + seenMemory.get() + " allowedMemory=" + allowedMemory.get(); } + @Override + public String getErrorMessage() { + return "Memory Circuit Breaker Triggered. Seen JVM heap memory usage " + seenMemory.get() + " and allocated threshold " + Review comment: Similarly, "greater than allocated threshold" ? ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java ########## @@ -0,0 +1,108 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.lang.invoke.MethodHandles; +import java.lang.management.ManagementFactory; +import java.lang.management.OperatingSystemMXBean; + +import org.apache.solr.core.SolrConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * <p> + * Tracks current CPU usage and triggers if the specified threshold is breached. + * + * This circuit breaker gets the average CPU load over the last minute and uses + * that data to take a decision. Ideally, we should be able to cache the value + * locally and only query once the minute has elapsed. However, that will introduce + * more complexity than the current structure and might not get us major performance + * wins. If this ever becomes a performance bottleneck, that can be considered. Review comment: Uh ... I see there's still some misunderstanding about this. The call itself is directly passed to the native method that invokes stdlib `getloadavg`, which in turn reads these values from the /proc pseudo-fs. So, the cost is truly minimal and the call doesn't block - if it turns out that it's still too costly to call for every request then we can introduce some timeout-based caching. These averages are so called exponentially weighted moving averages, so indeed a 1-min average has traces of past load values from outside the 1-min window, which helps in smoothing it. This may turn out to be sufficient to avoid false positives due to short-term spikes (such as large merges). Linux loadavg represents to some degree a combined CPU + disk IO load, so indeed intensive IO operations will affect it. We always have an option to use Codahale Meter to easily calculate 5- and 15-min EWMAs if it turns out that we're getting too many false positives. Until then users can configure higher thresholds, thus reducing the number of false positives at the cost of higher contention. ########## File path: solr/solr-ref-guide/src/circuit-breakers.adoc ########## @@ -35,33 +35,68 @@ will be disabled globally. Per circuit breaker configurations are specified in t <useCircuitBreakers>false</useCircuitBreakers> ---- +This flag acts as the highest authority and global controller of circuit breakers. For using specific circuit breakers, each one +needs to be individually enabled in addition to this flag being enabled. Review comment: If we change the format of the config this section needs to be updated too. ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/CPUCircuitBreaker.java ########## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.solr.util.circuitbreaker; + +import java.lang.invoke.MethodHandles; +import java.lang.management.ManagementFactory; +import java.lang.management.OperatingSystemMXBean; + +import org.apache.solr.core.SolrConfig; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +/** + * <p> + * Tracks current CPU usage and triggers if the specified threshold is breached. + * + * This circuit breaker gets the average CPU load over the last minute and uses + * that data to take a decision. Ideally, we should be able to cache the value + * locally and only query once the minute has elapsed. However, that will introduce + * more complexity than the current structure and might not get us major performance + * wins. If this ever becomes a performance bottleneck, that can be considered. + * </p> + * + * <p> + * The configuration to define which mode to use and the trigger threshold are defined in + * solrconfig.xml + * </p> + */ +public class CPUCircuitBreaker extends CircuitBreaker { + private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass()); + private static final OperatingSystemMXBean operatingSystemMXBean = ManagementFactory.getOperatingSystemMXBean(); + + private final boolean enabled; + private final double cpuUsageThreshold; + + // Assumption -- the value of these parameters will be set correctly before invoking getDebugInfo() + private static final ThreadLocal<Double> seenCPUUsage = ThreadLocal.withInitial(() -> 0.0); + + private static final ThreadLocal<Double> allowedCPUUsage = ThreadLocal.withInitial(() -> 0.0); + + public CPUCircuitBreaker(SolrConfig solrConfig) { + super(solrConfig); + + this.enabled = solrConfig.cpuCircuitBreakerEnabled; + this.cpuUsageThreshold = solrConfig.cpuCircuitBreakerThreshold; + } + + @Override + public boolean isTripped() { + if (!isEnabled()) { + return false; + } + + if (!enabled) { + return false; + } + + double localAllowedCPUUsage = getCpuUsageThreshold(); + double localSeenCPUUsage = calculateLiveCPUUsage(); + + if (localSeenCPUUsage < 0) { + if (log.isWarnEnabled()) { + String msg = "Unable to get CPU usage"; + + log.warn(msg); + } + + return false; + } + + allowedCPUUsage.set(localAllowedCPUUsage); + + seenCPUUsage.set(localSeenCPUUsage); + + return (localSeenCPUUsage >= localAllowedCPUUsage); + } + + @Override + public String getDebugInfo() { + + if (seenCPUUsage.get() == 0.0 || seenCPUUsage.get() == 0.0) { + log.warn("CPUCircuitBreaker's monitored values (seenCPUUSage, allowedCPUUsage) not set"); + } + + return "seenCPUUSage=" + seenCPUUsage.get() + " allowedCPUUsage=" + allowedCPUUsage.get(); + } + + @Override + public String getErrorMessage() { + return "CPU Circuit Breaker Triggered. Seen CPU usage " + seenCPUUsage.get() + " and allocated threshold " + Review comment: Maybe "greater than allocated threshold" ? ########## File path: solr/core/src/test-files/solr/collection1/conf/solrconfig-memory-circuitbreaker.xml ########## @@ -82,8 +82,14 @@ <useCircuitBreakers>true</useCircuitBreakers> Review comment: I hate typing too much :) and overly verbose names are often easier to mistype. What do you think about this (note the plural `circuitBreakers` - we have more than one now!): ``` <circuitBreakers enabled="true"> <cpuBreaker enabled="true" threshold="75"/> <memBreaker enabled="true" threshold="75"/> ... </circuitBreakers> ``` ########## File path: solr/core/src/java/org/apache/solr/util/circuitbreaker/MemoryCircuitBreaker.java ########## @@ -76,6 +79,10 @@ public boolean isTripped() { return false; } + if (!enabled) { + return false; + } + long localAllowedMemory = getCurrentMemoryThreshold(); Review comment: The same comment applies as in the CPU breaker regarding caching vs. calling this for every request. The cost should be minimal, it's just making a fast native call to read a value. ########## File path: solr/solr-ref-guide/src/circuit-breakers.adoc ########## @@ -35,33 +35,68 @@ will be disabled globally. Per circuit breaker configurations are specified in t <useCircuitBreakers>false</useCircuitBreakers> ---- +This flag acts as the highest authority and global controller of circuit breakers. For using specific circuit breakers, each one +needs to be individually enabled in addition to this flag being enabled. + == Currently Supported Circuit Breakers === JVM Heap Usage Based Circuit Breaker This circuit breaker tracks JVM heap memory usage and rejects incoming search requests with a 503 error code if the heap usage exceeds a configured percentage of maximum heap allocated to the JVM (-Xmx). The main configuration for this circuit breaker is controlling the threshold percentage at which the breaker will trip. -It does not logically make sense to have a threshold below 50% and above 95% of the max heap allocated to the JVM. Hence, the range -of valid values for this parameter is [50, 95], both inclusive. +To enable/disable JVM heap usage based circuit breaker, use the following configuration: + +[source,xml] +---- +<memoryCircuitBreakerEnabled>true</memoryCircuitBreakerEnabled> +---- + +Note that this configuration will be overridden by the global circuit breaker flag -- if circuit breakers are disabled, this flag +will not help you. + +To set the triggering threshold as a percentage of the max heap allocated to the JVM, use the following parameter. [source,xml] ---- <memoryCircuitBreakerThresholdPct>75</memoryCircuitBreakerThresholdPct> ---- +It does not logically make sense to have a threshold below 50% and above 95% of the max heap allocated to the JVM. Hence, the range +of valid values for this parameter is [50, 95], both inclusive. Consider the following example: JVM has been allocated a maximum heap of 5GB (-Xmx) and memoryCircuitBreakerThresholdPct is set to 75. In this scenario, the heap usage at which the circuit breaker will trip is 3.75GB. -Note that this circuit breaker is checked for each incoming search request and considers the current heap usage of the node, i.e every search -request will get the live heap usage and compare it against the set memory threshold. The check does not impact performance, -but any performance regressions that are suspected to be caused by this feature should be reported to the dev list. +=== CPU Utilization Based Circuit Breaker +This circuit breaker tracks CPU utilization and triggers if the average CPU utilization over the last one minute +exceeds a configurable threshold. Note that the value used in computation is over the last one minute -- so a sudden +spike in traffic that goes down might still cause the circuit breaker to trigger for a short while before it resolves +and updates the value. For more details of the calculation, please see https://en.wikipedia.org/wiki/Load_(computing) + +To enable/disable CPU utilization based circuit breaker, use the following configuration: + +[source,xml] +---- +<cpuCircuitBreakerEnabled>true</cpuCircuitBreakerEnabled> +---- + +Note that this configuration will be overridden by the global circuit breaker flag -- if circuit breakers are disabled, this flag +will not help you. + +To set the triggering threshold in units of CPU utilization, use the following parameter. +[source,xml] +---- +<cpuCircuitBreakerThreshold>75</cpuCircuitBreakerThreshold> Review comment: The value of 75 would rarely make sense for a production system, maybe change this to 20? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org